UTF8 in filenames

Thorvald Natvig

unread,

Aug 10, 2009, 8:03:05 AM8/10/09

to msysGit

I've taken a look at Issue 80, and a few questions have come up.

First, the issue tracker indicates that UTF8 isn't used consistently
even on Linux. Is that just because people have non-utf8 locales, or is
it the case that this isn't even compatible across different linux machines?

Second, coding style. I've inlined a small patch below, which adds two
generic macros to turn a utf8 char * var into a wchar_t one, using Win32
functions only. It has a few issues:

Is the use of such "code-expanding" macros allowed in git? These macros
declare new variables and contain multiple statements, meaning they
can't be used to, for example, do
if (strcmp(WCHAR_TO_UTF8_VAR(str_a), str_b)) ...
This is bad form, I know, but the alternative is to write out the
contents of those macros every time we do a wchar_t to utf8 conversion
(or back), and that's a lot of places that can contain off-by-one bugs
that will go missed due to there being lots of code replication.
An alternative is to make it a proper functions which will malloc() the
necesarry memory, but this means having to add free(path_wide) at every
exit point from a function that uses this.
Yet another alternative is to use a "sufficiently large" stack buffer,
but as "full length" paths (only available in unicode file functions)
can be up to 32767 characters, this means 64k memory for the wchar_t
version, and 96k for utf8 strings (as each wchar_t is up to 3 bytes in
utf8). That's a lot of stack memory, especially if we ever recurse.
Which alternatives would be acceptable?

This also uses C99 features, such as 'variable declarations after
statements' and 'variable-length stack arrays'. This means it compiles
without problems on GCC, but will not compile on, e.g., MSVC. Is this an
issue?

diff --git a/compat/win32.h b/compat/win32.h
index c26384e..a25859d 100644
--- a/compat/win32.h
+++ b/compat/win32.h
@@ -1,6 +1,16 @@
/* common Win32 functions for MinGW and Cygwin */
#include <windows.h>

+#define UTF8_TO_WCHAR_VAR(var) int var##_len = var ? (strlen(var)+1) : 0; \
+ int var##_wlen = var ? MultiByteToWideChar(CP_UTF8, 0, var,
var##_len, NULL, 0) : 0; \
+ wchar_t var##_w[var##_wlen]; \
+ MultiByteToWideChar(CP_UTF8, 0, var, var##_len, var##_w, var##_wlen);
+
+#define WCHAR_TO_UTF8_VAR(var) int var##_wlen = var ? (wcslen(var)+1) :
0; \
+ int var##_len = var ? WideCharToMultiByte(CP_UTF8, 0, var,
var##_wlen, NULL, 0, NULL, NULL) : 0; \
+ char var##_u[var##_len]; \
+ WideCharToMultiByte(CP_UTF8, 0, var, var##_wlen, var##_u,
var##_len, NULL, NULL);
+
static inline int file_attr_to_st_mode (DWORD attr)
{
int fMode = S_IREAD;
@@ -15,7 +25,15 @@ static inline int file_attr_to_st_mode (DWORD attr)

static inline int get_file_attr(const char *fname,
WIN32_FILE_ATTRIBUTE_DATA *fdata)
{
- if (GetFileAttributesExA(fname, GetFileExInfoStandard, fdata))
+ printf("WAS %s\n", fname);
+ UTF8_TO_WCHAR_VAR(fname);
+ printf("BECAME %ls %d %d\n", fname_w, fname_len, fname_wlen);
+
+ wchar_t *ptr = fname_w;
+ WCHAR_TO_UTF8_VAR(ptr);
+ printf("RETURN TO %s %d %d\n", ptr_u, ptr_len, ptr_wlen);
+
+ if (GetFileAttributesExW(fname_w, GetFileExInfoStandard, fdata))
return 0;

switch (GetLastError()) {

Johannes Schindelin

unread,

Aug 10, 2009, 9:36:24 AM8/10/09

to Thorvald Natvig, msysGit

Hi,

On Mon, 10 Aug 2009, Thorvald Natvig wrote:

> I've taken a look at Issue 80, and a few questions have come up.
>
> First, the issue tracker indicates that UTF8 isn't used consistently
> even on Linux. Is that just because people have non-utf8 locales, or is
> it the case that this isn't even compatible across different linux
> machines?

It is because the information which encoding is used for filenames is not
something easily obtained. So it is basically the choice of the project
which encoding it uses.

Unfortunately, Microsoft chose something as non-standard as you can get:
cp1252 (at least as far as I can tell).

The de-facto standard for international projects is UTF-8 these days, as
it comes at no cost for those who started in ASCII.

> Second, coding style. I've inlined a small patch below, which adds two
> generic macros to turn a utf8 char * var into a wchar_t one, using Win32
> functions only. It has a few issues:
>
> Is the use of such "code-expanding" macros allowed in git? These macros
> declare new variables and contain multiple statements, meaning they can't be
> used to, for example, do
> if (strcmp(WCHAR_TO_UTF8_VAR(str_a), str_b)) ...

We usually make "static inline" functions from such beasts, I think.

> This is bad form, I know, but the alternative is to write out the
> contents of those macros every time we do a wchar_t to utf8 conversion
> (or back), and that's a lot of places that can contain off-by-one bugs
> that will go missed due to there being lots of code replication.
>
> An alternative is to make it a proper functions which will malloc() the
> necesarry memory, but this means having to add free(path_wide) at every
> exit point from a function that uses this.
>
> Yet another alternative is to use a "sufficiently large" stack buffer,
> but as "full length" paths (only available in unicode file functions)
> can be up to 32767 characters, this means 64k memory for the wchar_t
> version, and 96k for utf8 strings (as each wchar_t is up to 3 bytes in
> utf8). That's a lot of stack memory, especially if we ever recurse.
> Which alternatives would be acceptable?

We actually have a precendence for something like that: sha1_to_hex(). It
returns one of 4 static buffers (which are all large enough), and rotates
the buffers around. So you need to call sha1_to_hex() 4 times in order to
get the same return value.

> This also uses C99 features, such as 'variable declarations after
> statements' and 'variable-length stack arrays'. This means it compiles
> without problems on GCC, but will not compile on, e.g., MSVC. Is this an
> issue?

We try very much to avoid that in git.git, even if it is not strictly
necessary for MinGW32, as we use GCC there. But you might have heard that
somebody asked for MSVC support recently; this would be much harder then.
And I think it should be easy to avoid the dependence on a C99
compiler.

> diff --git a/compat/win32.h b/compat/win32.h
> index c26384e..a25859d 100644
> --- a/compat/win32.h
> +++ b/compat/win32.h
> @@ -1,6 +1,16 @@
> /* common Win32 functions for MinGW and Cygwin */
> #include <windows.h>
>
> +#define UTF8_TO_WCHAR_VAR(var) int var##_len = var ? (strlen(var)+1) : 0; \
> + int var##_wlen = var ? MultiByteToWideChar(CP_UTF8, 0, var, var##_len, NULL, 0) : 0; \
> + wchar_t var##_w[var##_wlen]; \
> + MultiByteToWideChar(CP_UTF8, 0, var, var##_len, var##_w, var##_wlen);
> +
> +#define WCHAR_TO_UTF8_VAR(var) int var##_wlen = var ? (wcslen(var)+1) : 0; \
> + int var##_len = var ? WideCharToMultiByte(CP_UTF8, 0, var, var##_wlen, NULL, 0, NULL, NULL) : 0; \
> + char var##_u[var##_len]; \
> + WideCharToMultiByte(CP_UTF8, 0, var, var##_wlen, var##_u, var##_len, NULL, NULL);
> +

I agree that we need to use static buffers for such conversions, as we're
not guaranteed that the buffers are large enough? There are UTF-8 strings
which are shorter than their wide char equivalents (think ASCII-only), but
there are also UTF-8 strings which are longer (think about the 0xffff
codeword; even if it fits into 16 bits, due to the prefix coding of UTF-8,
the number of required bytes is larger than 2).

But we do not need to use heap space: how about something like

static wchar_t *utf8_to_wchar(const char *filename)
{
static char buf[4][(PATH_MAX + 1) * 2];
static int counter;
int result = MultiByteToWideChar(CP_UTF8, 0,
filename, strlen(filename) + 1,
buf[counter], ARRAY_SIZE(buf[counter]));
if (result >= 0)
return buf[counter++];
error("Could not convert utf8-name to widechars: %s",
result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
"flags/parameter error");
return NULL;
}

static char *wchar_to_utf8(const wchar_t *filename)
{
static wchar_t buf[4][(PATH_MAX + 1) * 2];
static int counter;
int result = WideCharToMultiByte(CP_UTF8, 0,
filename, wcslen(filename) + 1,
buf[counter], sizeof(buf[counter]));
if (result >= 0)
return buf[counter++];
error("Could not convert widechars to utf8-name: %s",
result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
"flags/parameter error");
return NULL;
}

But maybe the rotating buffer is overkill?

> @@ -15,7 +25,15 @@ static inline int file_attr_to_st_mode (DWORD attr)
>
> static inline int get_file_attr(const char *fname,
> WIN32_FILE_ATTRIBUTE_DATA *fdata)
> {
> - if (GetFileAttributesExA(fname, GetFileExInfoStandard, fdata))
> + printf("WAS %s\n", fname);
> + UTF8_TO_WCHAR_VAR(fname);
> + printf("BECAME %ls %d %d\n", fname_w, fname_len, fname_wlen);
> +
> + wchar_t *ptr = fname_w;
> + WCHAR_TO_UTF8_VAR(ptr);
> + printf("RETURN TO %s %d %d\n", ptr_u, ptr_len, ptr_wlen);
> +
> + if (GetFileAttributesExW(fname_w, GetFileExInfoStandard, fdata))
> return 0;

I'd actually put the debug messages into the newly-created functions.

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 10, 2009, 10:46:44 AM8/10/09

to Johannes Schindelin, msysGit

>> This is bad form, I know, but the alternative is to write out the
>> contents of those macros every time we do a wchar_t to utf8 conversion
>> (or back), and that's a lot of places that can contain off-by-one bugs
>> that will go missed due to there being lots of code replication.
>>
>> An alternative is to make it a proper functions which will malloc() the
>> necesarry memory, but this means having to add free(path_wide) at every
>> exit point from a function that uses this.
>>
>> Yet another alternative is to use a "sufficiently large" stack buffer,
>> but as "full length" paths (only available in unicode file functions)
>> can be up to 32767 characters, this means 64k memory for the wchar_t
>> version, and 96k for utf8 strings (as each wchar_t is up to 3 bytes in
>> utf8). That's a lot of stack memory, especially if we ever recurse.
>> Which alternatives would be acceptable?
>>
>
> We actually have a precendence for something like that: sha1_to_hex(). It
> returns one of 4 static buffers (which are all large enough), and rotates
> the buffers around. So you need to call sha1_to_hex() 4 times in order to
> get the same return value.
>

I suppose we could use something similar; it would waste half a MB of
memory for the static buffers, but that may be "cheap enough" these days.

>> +#define UTF8_TO_WCHAR_VAR(var) int var##_len = var ? (strlen(var)+1) : 0; \
>> + int var##_wlen = var ? MultiByteToWideChar(CP_UTF8, 0, var, var##_len, NULL, 0) : 0; \
>> + wchar_t var##_w[var##_wlen]; \
>> + MultiByteToWideChar(CP_UTF8, 0, var, var##_len, var##_w, var##_wlen);
>> +
>>
>>

> I agree that we need to use static buffers for such conversions, as we're
> not guaranteed that the buffers are large enough? There are UTF-8 strings
> which are shorter than their wide char equivalents (think ASCII-only), but
> there are also UTF-8 strings which are longer (think about the 0xffff
> codeword; even if it fits into 16 bits, due to the prefix coding of UTF-8,
> the number of required bytes is larger than 2).
>

The above actually first figures out how many characters are needed (the
int var##_wlen line), then allocates that many characters on the stack
(using C99 vararrays).

Could be. It also might be just as easy to just declare

#define UTF8_TO_WCHAR(dest, src) MultiByteToWideChar(CP_UTF8, 0, src,
-1, dest, MAX_PATH)

and then in each function
wchar_t path_w[MAX_PATH];
UTF8_TO_WCHAR(path_w, path);
.. or simply write the full MultiByteToWideChar call each time. Hm. Come
to think of it, that probably is the easiest.

That brings up another point though; MAX_PATH is 255 on Windows, and
using "normal" APIs (including unicode), that's the maximum length of
any path. However, if you use absolute paths, prefixed with \\?\, you
can use up to 32767 characters. This works in the API, but it will
confuse explorer, the command shell and most programs that don't
explicitly support long paths, which is pretty much all of them. Perhaps
leave it at 255 for now and let somebody that cares about Very Long
paths worry about it later?

Best Regards,
Thorvald

Johannes Schindelin

unread,

Aug 10, 2009, 11:17:23 AM8/10/09

to Thorvald Natvig, msysGit

Hi,

On Mon, 10 Aug 2009, Thorvald Natvig wrote:

> > > This is bad form, I know, but the alternative is to write out the
> > > contents of those macros every time we do a wchar_t to utf8
> > > conversion (or back), and that's a lot of places that can contain
> > > off-by-one bugs that will go missed due to there being lots of code
> > > replication.
> > >
> > > An alternative is to make it a proper functions which will malloc()
> > > the necesarry memory, but this means having to add free(path_wide)
> > > at every exit point from a function that uses this.
> > >
> > > Yet another alternative is to use a "sufficiently large" stack
> > > buffer, but as "full length" paths (only available in unicode file
> > > functions) can be up to 32767 characters, this means 64k memory for
> > > the wchar_t version, and 96k for utf8 strings (as each wchar_t is up
> > > to 3 bytes in utf8). That's a lot of stack memory, especially if we
> > > ever recurse. Which alternatives would be acceptable?
> > >
> >
> > We actually have a precendence for something like that: sha1_to_hex().
> > It returns one of 4 static buffers (which are all large enough), and
> > rotates the buffers around. So you need to call sha1_to_hex() 4 times
> > in order to get the same return value.
> >
> I suppose we could use something similar; it would waste half a MB of
> memory for the static buffers, but that may be "cheap enough" these
> days.

Even if you use 32767 as MAX_PATH, I end up with 8*32kB = 256kB. But I
actually wanted to stay with 255, which ends up being .25*8kB = 2kB.

But the way you're using the buffers, I think you do not even need 4
buffers, but only one.

> > > +#define UTF8_TO_WCHAR_VAR(var) int var##_len = var ? (strlen(var)+1) : 0;
> > > \
> > > + int var##_wlen = var ? MultiByteToWideChar(CP_UTF8, 0, var,
> > > var##_len, NULL, 0) : 0; \
> > > + wchar_t var##_w[var##_wlen]; \
> > > + MultiByteToWideChar(CP_UTF8, 0, var, var##_len, var##_w, var##_wlen);
> > > +
> > >
> > >
> > I agree that we need to use static buffers for such conversions, as we're
> > not guaranteed that the buffers are large enough? There are UTF-8 strings
> > which are shorter than their wide char equivalents (think ASCII-only), but
> > there are also UTF-8 strings which are longer (think about the 0xffff
> > codeword; even if it fits into 16 bits, due to the prefix coding of UTF-8,
> > the number of required bytes is larger than 2).
> >
> The above actually first figures out how many characters are needed (the int
> var##_wlen line), then allocates that many characters on the stack (using C99
> vararrays).

Yes, that is something I want to avoid.

The problem is that you violate the DRY principle (Don't Repeat Yourself),
and the consequence is all too easy to see: Assuming that the next patch
would be the introduction of a config variable to turn off the UTF-8
assumption, you'd need to change _all_ of the calling sites.

> That brings up another point though; MAX_PATH is 255 on Windows, and
> using "normal" APIs (including unicode), that's the maximum length of
> any path. However, if you use absolute paths, prefixed with \\?\, you
> can use up to 32767 characters. This works in the API, but it will
> confuse explorer, the command shell and most programs that don't
> explicitly support long paths, which is pretty much all of them. Perhaps
> leave it at 255 for now and let somebody that cares about Very Long
> paths worry about it later?

As I said, I'd stay with 255, for all the reasons you listed.

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 10, 2009, 6:06:00 PM8/10/09

to Johannes Schindelin, msysGit

> Even if you use 32767 as MAX_PATH, I end up with 8*32kB = 256kB. But I
> actually wanted to stay with 255, which ends up being .25*8kB = 2kB.
>
> But the way you're using the buffers, I think you do not even need 4
> buffers, but only one.
>

I know at need at least 2 (for both arguments to link, symlink, rename
etc), so just to be cautious I've used 4.

> The problem is that you violate the DRY principle (Don't Repeat Yourself),
> and the consequence is all too easy to see: Assuming that the next patch
> would be the introduction of a config variable to turn off the UTF-8
> assumption, you'd need to change _all_ of the calling sites.
>

I made it a buffer-rotating static inline functions. As it turns out,
all cases where you want to convert wchar_t back to utf8 are cases where
the return buffer is already allocated somewhere, so I only have a
direct version for that.

Patch attached. Commits, checkouts, removals, even symlinks work.

To get this done, I had to override pretty much every file function that
takes a char * pointer, as all of those need to deal with utf8->wchar
conversion. It was extra fun when the basedir contained utf8 that
couldn't be represented in ANSI, as that meant the chdir() to project
root failed. I also had to completely replace mingw_readdir, as the
_wdiropen in mingw doesn't append the '/*' to the dir name in the entry.
Since I had to override diropen anyway, I simply made a new handler for it.

Note that neither msys' bash or regular cmd can show unicode characters
for files or directories. The folder above became 'Folder ????'.
However, with cmd, you can "cd Folder-<TAB>", and it works, whereas bash
reports "No such file or directory.". I also haven't looked at adding
wide-char argument interpretation; so you can't "git add
<filename_with_unicode>" directly.

This will probably be attached with tab-expansion and flowed format
again, but I figured it is still easier to comment on it if it is
inline. Hmm. Actually, due to the number of changes, espeically to the
readdir() things, the patch makes it look much more gruesome than it is.
Do you want me to push it as well, so you can see the "finished" result?

Best Regards,
Thorvald

From 7ae0412f10c15327731200328db2bb8ca75b6a89 Mon Sep 17 00:00:00 2001
From: Thorvald Natvig <sli...@users.sourceforge.net>
Date: Mon, 10 Aug 2009 23:44:10 +0200
Subject: [PATCH 2/2] UTF8 Support

---
compat/mingw.c | 339
++++++++++++++++++++++++++++++++++++++------------------
compat/mingw.h | 41 ++++---
2 files changed, 257 insertions(+), 123 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 20f370b..800c782 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -2,6 +2,7 @@
#include "win32.h"
#include <conio.h>
#include <winioctl.h>
+#include <wchar.h>
#include "../strbuf.h"

unsigned int _CRT_fmode = _O_BINARY;
@@ -119,6 +120,47 @@ static int err_win_to_posix(DWORD winerr)
return error;
}

+static inline wchar_t *utf8_to_wchar(const char *src)
+{
+ /* Allow some extra room for path manipulations
+ */
+ static wchar_t buffer[4][PATH_MAX+4];
+ static int counter = 0;
+ int result;
+
+ if (! src)
+ return NULL;
+
+ if (++counter >= 4)
+ counter = 0;
+
+ result = MultiByteToWideChar(CP_UTF8, 0, src, -1, buffer[counter],
PATH_MAX);
+ if (result >= 0)
+ return buffer[counter];
+ error("Could not convert utf8 '%s' to widechars: %s", src,
+ result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
+ result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
+ "flags/parameter error");
+ return NULL;
+}
+
+static inline int wchar_to_utf8(char *dst, int length, const wchar_t *src)
+{
+ int result;
+
+ if (! src)
+ return 0;
+
+ result = WideCharToMultiByte(CP_UTF8, 0, src, -1, dst, length,
NULL, NULL);
+ if (result >= 0)
+ return result;
+ error("Could not convert widechar '%ls' to utf8: %s", src,
+ result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
+ result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
+ "flags/parameter error");
+ return 0;
+}
+
#undef open
int mingw_open (const char *filename, int oflags, ...)
{
@@ -130,7 +172,9 @@ int mingw_open (const char *filename, int oflags, ...)

if (!strcmp(filename, "/dev/null"))
filename = "nul";
- int fd = open(filename, oflags, mode);
+
+
+ int fd = _wopen(utf8_to_wchar(filename), oflags, mode);
if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
DWORD attrs = GetFileAttributes(filename);
if (attrs != INVALID_FILE_ATTRIBUTES && (attrs &
FILE_ATTRIBUTE_DIRECTORY))
@@ -139,6 +183,31 @@ int mingw_open (const char *filename, int oflags, ...)
return fd;
}

+int mingw_access(const char *filename, int mode)
+{
+ return _waccess(utf8_to_wchar(filename), mode);
+}
+
+int mingw_unlink(const char *filename)
+{
+ return _wunlink(utf8_to_wchar(filename));
+}
+
+int mingw_chdir(const char *path)
+{
+ return _wchdir(utf8_to_wchar(path));
+}
+
+int mingw_mkdir(const char *path, int mode)
+{
+ return _wmkdir(utf8_to_wchar(path));
+}
+
+int mingw_rmdir(const char *path)
+{
+ return _wrmdir(utf8_to_wchar(path));
+}
+
static inline time_t filetime_to_time_t(const FILETIME *ft)
{
long long winTime = ((long long)ft->dwHighDateTime << 32) +
ft->dwLowDateTime;
@@ -147,15 +216,70 @@ static inline time_t filetime_to_time_t(const
FILETIME *ft)
return (time_t)winTime;
}

+static int do_readlink(const wchar_t *path, wchar_t *buf, size_t bufsiz)
+{
+ HANDLE handle = CreateFileW(path, GENERIC_READ,
+ FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
+ NULL, OPEN_EXISTING,
+ FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
+ NULL);
+
+ if (handle != INVALID_HANDLE_VALUE) {
+ unsigned char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
+ DWORD dummy = 0;
+ if (DeviceIoControl(handle, FSCTL_GET_REPARSE_POINT, NULL, 0,
buffer,
+ MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dummy, NULL)) {
+ REPARSE_DATA_BUFFER *b = (REPARSE_DATA_BUFFER *) buffer;
+ if (b->ReparseTag == IO_REPARSE_TAG_SYMLINK) {
+ int len =
b->SymbolicLinkReparseBuffer.SubstituteNameLength / sizeof(wchar_t);
+ int offset =
b->SymbolicLinkReparseBuffer.SubstituteNameOffset / sizeof(wchar_t);
+ if (buf) {
+ len = (bufsiz < len) ? bufsiz : len;
+ wcsncpy(buf, &
b->SymbolicLinkReparseBuffer.PathBuffer[offset], len);
+ buf[len] = 0;
+ }
+ CloseHandle(handle);
+ return len;
+ }
+ }
+
+ CloseHandle(handle);
+ }
+
+ errno = EINVAL;
+ return -1;
+}
+
+
+static inline int get_file_attrw(const wchar_t *fname,
WIN32_FILE_ATTRIBUTE_DATA *fdata)
+{
+ if (GetFileAttributesExW(fname, GetFileExInfoStandard, fdata))
+ return 0;
+
+ switch (GetLastError()) {
+ case ERROR_ACCESS_DENIED:
+ case ERROR_SHARING_VIOLATION:
+ case ERROR_LOCK_VIOLATION:
+ case ERROR_SHARING_BUFFER_EXCEEDED:
+ return EACCES;
+ case ERROR_BUFFER_OVERFLOW:
+ return ENAMETOOLONG;
+ case ERROR_NOT_ENOUGH_MEMORY:
+ return ENOMEM;
+ default:
+ return ENOENT;
+ }
+}
+
/* We keep the do_lstat code in a separate function to avoid recursion.
* When a path ends with a slash, the stat will fail with ENOENT. In
* this case, we strip the trailing slashes and stat again.
*/
-static int do_lstat(const char *file_name, struct stat *buf)
+static int do_lstat(const wchar_t *file_name, struct stat *buf)
{
WIN32_FILE_ATTRIBUTE_DATA fdata;

- if (!(errno = get_file_attr(file_name, &fdata))) {
+ if (!(errno = get_file_attrw(file_name, &fdata))) {
buf->st_ino = 0;
buf->st_gid = 0;
buf->st_uid = 0;
@@ -168,16 +292,15 @@ static int do_lstat(const char *file_name, struct
stat *buf)
buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
- WIN32_FIND_DATAA findbuf;
- HANDLE handle = FindFirstFileA(file_name, &findbuf);
+ WIN32_FIND_DATAW findbuf;
+ HANDLE handle = FindFirstFileW(file_name, &findbuf);
if (handle != INVALID_HANDLE_VALUE) {
if ((findbuf.dwFileAttributes &
FILE_ATTRIBUTE_REPARSE_POINT) &&
(findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
- char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
buf->st_mode = S_IREAD | S_IFLNK;
if (!(findbuf.dwFileAttributes &
FILE_ATTRIBUTE_READONLY))
buf->st_mode |= S_IWRITE;
- buf->st_size = readlink(file_name, buffer,
MAXIMUM_REPARSE_DATA_BUFFER_SIZE);
+ buf->st_size = do_readlink(file_name, NULL, 0);
}
FindClose(handle);
}
@@ -196,9 +319,10 @@ static int do_lstat(const char *file_name, struct
stat *buf)
int mingw_lstat(const char *file_name, struct stat *buf)
{
int namelen;
- static char alt_name[PATH_MAX];

- if (!do_lstat(file_name, buf))
+ wchar_t *file_name_w = utf8_to_wchar(file_name);
+
+ if (!do_lstat(file_name_w, buf))
return 0;

/* if file_name ended in a '/', Windows returned ENOENT;
@@ -207,17 +331,17 @@ int mingw_lstat(const char *file_name, struct stat
*buf)
if (errno != ENOENT)
return -1;

- namelen = strlen(file_name);
- if (namelen && file_name[namelen-1] != '/')
+ namelen = wcslen(file_name_w);
+
+ if (namelen && file_name_w[namelen-1] != L'/')
return -1;
- while (namelen && file_name[namelen-1] == '/')
+ while (namelen && file_name_w[namelen-1] == L'/')
--namelen;
if (!namelen || namelen >= PATH_MAX)
return -1;

- memcpy(alt_name, file_name, namelen);
- alt_name[namelen] = 0;
- return do_lstat(alt_name, buf);
+ file_name_w[namelen] = 0;
+ return do_lstat(file_name_w, buf);
}

#undef fstat
@@ -265,7 +389,7 @@ int mingw_utime (const char *file_name, const struct
utimbuf *times)
int fh, rc;

/* must have write permission */
- if ((fh = open(file_name, O_RDWR | O_BINARY)) < 0)
+ if ((fh = _wopen(utf8_to_wchar(file_name), O_RDWR | O_BINARY)) < 0)
return -1;

time_t_to_filetime(times->modtime, &mft);
@@ -287,10 +411,11 @@ unsigned int sleep (unsigned int seconds)

int mkstemp(char *template)
{
- char *filename = mktemp(template);
+ wchar_t *filename = _wmktemp(utf8_to_wchar(template));
if (filename == NULL)
return -1;
- return open(filename, O_RDWR | O_CREAT, 0600);
+ wchar_to_utf8(template, strlen(template)+1, filename);
+ return _wopen(filename, O_RDWR | O_CREAT, 0600);
}

int gettimeofday(struct timeval *tv, void *tz)
@@ -435,17 +560,23 @@ struct tm *localtime_r(const time_t *timep, struct
tm *result)
return result;
}

+
+
#undef getcwd
char *mingw_getcwd(char *pointer, int len)
{
int i;
- char *ret = getcwd(pointer, len);
- if (!ret)
- return ret;
- for (i = 0; pointer[i]; i++)
- if (pointer[i] == '\\')
- pointer[i] = '/';
- return ret;
+ wchar_t buffer[PATH_MAX];
+ wchar_t *ret = _wgetcwd(buffer, PATH_MAX);
+ if (!ret || (wcslen(buffer) >= len)) {
+ errno = ERANGE;
+ return (char *) -1;
+ }
+ for (i = 0; buffer[i]; i++)
+ if (buffer[i] == L'\\')
+ buffer[i] = L'/';
+ wchar_to_utf8(pointer, len, buffer);
+ return pointer;
}

#undef getenv
@@ -611,7 +742,7 @@ static void free_path_split(char **path)
*/
static char *lookup_prog(const char *dir, const char *cmd, int isexe,
int exe_only)
{
- char path[MAX_PATH];
+ char path[PATH_MAX];
snprintf(path, sizeof(path), "%s/%s.exe", dir, cmd);

if (!isexe && access(path, F_OK) == 0)
@@ -1130,29 +1261,31 @@ int mingw_rename(const char *pold, const char *pnew)
DWORD attrs, gle;
int tries = 0;
static const int delay[] = { 0, 1, 10, 20, 40 };
+ const wchar_t *poldw = utf8_to_wchar(pold);
+ const wchar_t *pneww = utf8_to_wchar(pnew);

/*
* Try native rename() first to get errno right.
* It is based on MoveFile(), which cannot overwrite existing files.
*/
- if (!rename(pold, pnew))
+ if (!_wrename(poldw, pneww))
return 0;
if (errno != EEXIST)
return -1;
repeat:
- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
+ if (MoveFileExW(poldw, pneww, MOVEFILE_REPLACE_EXISTING))
return 0;
/* TODO: translate more errors */
gle = GetLastError();
if (gle == ERROR_ACCESS_DENIED &&
- (attrs = GetFileAttributes(pnew)) != INVALID_FILE_ATTRIBUTES) {
+ (attrs = GetFileAttributesW(pneww)) != INVALID_FILE_ATTRIBUTES) {
if (attrs & FILE_ATTRIBUTE_DIRECTORY) {
errno = EISDIR;
return -1;
}
if ((attrs & FILE_ATTRIBUTE_READONLY) &&
- SetFileAttributes(pnew, attrs & ~FILE_ATTRIBUTE_READONLY)) {
- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
+ SetFileAttributesW(pneww, attrs & ~FILE_ATTRIBUTE_READONLY)) {
+ if (MoveFileExW(poldw, pneww, MOVEFILE_REPLACE_EXISTING))
return 0;
gle = GetLastError();
/* revert file attributes on failure */
@@ -1331,11 +1464,11 @@ void mingw_open_html(const char *unixpath)

int link(const char *oldpath, const char *newpath)
{
- typedef BOOL WINAPI (*T)(const char*, const char*,
LPSECURITY_ATTRIBUTES);
+ typedef BOOL WINAPI (*T)(const wchar_t*, const wchar_t*,
LPSECURITY_ATTRIBUTES);
static T create_hard_link = NULL;
if (!create_hard_link) {
create_hard_link = (T) GetProcAddress(
- GetModuleHandle("kernel32.dll"), "CreateHardLinkA");
+ GetModuleHandle("kernel32.dll"), "CreateHardLinkW");
if (!create_hard_link)
create_hard_link = (T)-1;
}
@@ -1343,7 +1476,7 @@ int link(const char *oldpath, const char *newpath)
errno = ENOSYS;
return -1;
}
- if (!create_hard_link(newpath, oldpath, NULL)) {
+ if (!create_hard_link(utf8_to_wchar(newpath),
utf8_to_wchar(oldpath), NULL)) {
errno = err_win_to_posix(GetLastError());
return -1;
}
@@ -1352,11 +1485,11 @@ int link(const char *oldpath, const char *newpath)

int symlink(const char *oldpath, const char *newpath)
{
- typedef BOOL WINAPI (*symlink_fn)(const char*, const char*, DWORD);
+ typedef BOOL WINAPI (*symlink_fn)(const wchar_t*, const wchar_t*,
DWORD);
static symlink_fn create_symbolic_link = NULL;
if (!create_symbolic_link) {
create_symbolic_link = (symlink_fn) GetProcAddress(
- GetModuleHandle("kernel32.dll"), "CreateSymbolicLinkA");
+ GetModuleHandle("kernel32.dll"), "CreateSymbolicLinkW");
if (!create_symbolic_link)
create_symbolic_link = (symlink_fn)-1;
}
@@ -1365,7 +1498,7 @@ int symlink(const char *oldpath, const char *newpath)
return -1;
}

- if (!create_symbolic_link(newpath, oldpath, 0)) {
+ if (!create_symbolic_link(utf8_to_wchar(newpath),
utf8_to_wchar(oldpath), 0)) {
errno = err_win_to_posix(GetLastError());
return -1;
}
@@ -1374,32 +1507,10 @@ int symlink(const char *oldpath, const char
*newpath)

int readlink(const char *path, char *buf, size_t bufsiz)
{
- HANDLE handle = CreateFile(path, GENERIC_READ,
- FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
- NULL, OPEN_EXISTING,
- FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
- NULL);
-
- if (handle != INVALID_HANDLE_VALUE) {
- unsigned char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
- DWORD dummy = 0;
- if (DeviceIoControl(handle, FSCTL_GET_REPARSE_POINT, NULL, 0,
buffer,
- MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dummy, NULL)) {
- REPARSE_DATA_BUFFER *b = (REPARSE_DATA_BUFFER *) buffer;
- if (b->ReparseTag == IO_REPARSE_TAG_SYMLINK) {
- int len =
b->SymbolicLinkReparseBuffer.SubstituteNameLength / sizeof(wchar_t);
- int offset =
b->SymbolicLinkReparseBuffer.SubstituteNameOffset / sizeof(wchar_t);
- snprintf(buf, bufsiz, "%*ls", len, &
b->SymbolicLinkReparseBuffer.PathBuffer[offset]);
- CloseHandle(handle);
- return len;
- }
- }
-
- CloseHandle(handle);
- }
-
- errno = EINVAL;
- return -1;
+ wchar_t buffer[MAX_PATH];
+ int result = do_readlink(utf8_to_wchar(path), buffer, MAX_PATH);
+ wchar_to_utf8(buf, bufsiz, buffer);
+ return result;
}

char *getpass(const char *prompt)
@@ -1417,64 +1528,80 @@ char *getpass(const char *prompt)
return strbuf_detach(&buf, NULL);
}

-#ifndef NO_MINGW_REPLACE_READDIR
-/* MinGW readdir implementation to avoid extra lstats for Git */
+/* MinGW readdir implementation to avoid extra lstats for Git. */
struct mingw_DIR
{
- struct _finddata_t dd_dta; /* disk transfer area for this
dir */
- struct mingw_dirent dd_dir; /* Our own implementation,
including d_type */
- long dd_handle; /* _findnext handle */
- int dd_stat; /* 0 = next entry to read is first
entry, -1 = off the end, positive = 0 based index of next entry */
- char dd_name[1]; /* given path for dir with search
pattern (struct is extended) */
+ HANDLE handle;
+ WIN32_FIND_DATAW findbuf;
+ struct dirent entry;
+ int first;
};

-struct dirent *mingw_readdir(DIR *dir)
+DIR *mingw_opendir(const char *name)
{
- WIN32_FIND_DATAA buf;
- HANDLE handle;
- struct mingw_DIR *mdir = (struct mingw_DIR*)dir;
+ struct mingw_DIR *dir = xmalloc(sizeof(struct mingw_DIR));
+ memset(dir, 0, sizeof(struct mingw_DIR));
+
+ wchar_t *wname = utf8_to_wchar(name);
+ int len = wcslen(wname);
+ if (len > 0) {
+ if (wname[len-1] != L'/')
+ wname[len++] = L'/';
+ wname[len++] = L'*';
+ wname[len] = 0;
+ }
+
+ dir->first = 1;
+ dir->handle = FindFirstFileW(wname, & dir->findbuf);
+ DWORD lasterr = GetLastError();

- if (!dir->dd_handle) {
- errno = EBADF; /* No set_errno for mingw */
+ if (dir->handle == INVALID_HANDLE_VALUE) {
+ errno = err_win_to_posix(lasterr);
+ free(dir);
return NULL;
}
+ return (DIR *) dir;
+}
+
+int mingw_closedir(DIR *indir)
+{
+ struct mingw_DIR *dir = (struct mingw_DIR *) indir;
+ HANDLE handle = dir->handle;
+ free(dir);
+
+ if (! FindClose(handle)) {
+ errno = err_win_to_posix(GetLastError());
+ return -1;
+ }
+ return 0;
+}
+
+struct dirent *mingw_readdir(DIR *indir)
+{
+ struct mingw_DIR *dir = (struct mingw_DIR*)indir;

- if (dir->dd_handle == (long)INVALID_HANDLE_VALUE && dir->dd_stat == 0)
- {
- handle = FindFirstFileA(dir->dd_name, &buf);
- DWORD lasterr = GetLastError();
- dir->dd_handle = (long)handle;
- if (handle == INVALID_HANDLE_VALUE && (lasterr !=
ERROR_NO_MORE_FILES)) {
- errno = err_win_to_posix(lasterr);
+ if (dir->first)
+ dir->first = 0;
+ else {
+ if (! FindNextFileW(dir->handle, & dir->findbuf)) {
+ DWORD lasterr = GetLastError();
+ if (lasterr != ERROR_NO_MORE_FILES)
+ errno = err_win_to_posix(lasterr);
return NULL;
}
- } else if (dir->dd_handle == (long)INVALID_HANDLE_VALUE) {
- return NULL;
- } else if (!FindNextFileA((HANDLE)dir->dd_handle, &buf)) {
- DWORD lasterr = GetLastError();
- FindClose((HANDLE)dir->dd_handle);
- dir->dd_handle = (long)INVALID_HANDLE_VALUE;
- /* POSIX says you shouldn't set errno when readdir can't
- find any more files; so, if another error we leave it set. */
- if (lasterr != ERROR_NO_MORE_FILES)
- errno = err_win_to_posix(lasterr);
- return NULL;
}

- /* We get here if `buf' contains valid data. */
- strcpy(dir->dd_dir.d_name, buf.cFileName);
- ++dir->dd_stat;
+ wchar_to_utf8(dir->entry.d_name, FILENAME_MAX, dir->findbuf.cFileName);

/* Set file type, based on WIN32_FIND_DATA */
- mdir->dd_dir.d_type = 0;
- if ((buf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
- (buf.dwReserved0 == IO_REPARSE_TAG_SYMLINK))
- mdir->dd_dir.d_type |= DT_LNK;
- else if (buf.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
- mdir->dd_dir.d_type |= DT_DIR;
+ dir->entry.d_type = 0;
+ if ((dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
+ (dir->findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK))
+ dir->entry.d_type |= DT_LNK;
+ else if (dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
+ dir->entry.d_type |= DT_DIR;
else
- mdir->dd_dir.d_type |= DT_REG;
+ dir->entry.d_type |= DT_REG;

- return (struct dirent*)&dir->dd_dir;
+ return & dir->entry;
}
-#endif // !NO_MINGW_REPLACE_READDIR
diff --git a/compat/mingw.h b/compat/mingw.h
index 4f9f7f4..7a7977d 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -96,20 +96,6 @@ static inline int fcntl(int fd, int cmd, long arg)
* simple adaptors
*/

-static inline int mingw_mkdir(const char *path, int mode)
-{
- return mkdir(path);
-}
-#define mkdir mingw_mkdir
-
-static inline int mingw_unlink(const char *pathname)
-{
- /* read-only files cannot be removed */
- chmod(pathname, 0666);
- return unlink(pathname);
-}
-#define unlink mingw_unlink
-
static inline int waitpid(pid_t pid, int *status, unsigned options)
{
if (options == 0)
@@ -144,6 +130,22 @@ int readlink(const char *path, char *buf, size_t
bufsiz);
int mingw_open (const char *filename, int oflags, ...);
#define open mingw_open

+int mingw_access(const char *filename, int mode);
+#undef access
+#define access mingw_access
+
+int mingw_chdir(const char *path);
+#define chdir mingw_chdir
+
+int mingw_mkdir(const char *path, int mode);
+#define mkdir mingw_mkdir
+
+int mingw_rmdir(const char *path);
+#define rmdir mingw_rmdir
+
+int mingw_unlink(const char *file_name);
+#define unlink mingw_unlink
+
char *mingw_getcwd(char *pointer, int len);
#define getcwd mingw_getcwd

@@ -249,7 +251,6 @@ int main(int argc, const char **argv) \
} \
static int mingw_main(c,v)

-#ifndef NO_MINGW_REPLACE_READDIR
/*
* A replacement of readdir, to ensure that it reads the file type at
* the same time. This avoid extra unneeded lstats in git on MinGW
@@ -274,6 +275,12 @@ struct mingw_dirent
char d_name[FILENAME_MAX]; /* File name. */
};
#define dirent mingw_dirent
-#define readdir(x) mingw_readdir(x)
+
+DIR *mingw_opendir(const char *name);
+#define opendir mingw_opendir
+
+int mingw_closedir(DIR *dir);
+#define closedir mingw_closedir
+
struct dirent *mingw_readdir(DIR *dir);
-#endif // !NO_MINGW_REPLACE_READDIR
+#define readdir(x) mingw_readdir(x)
--
1.6.4

Johannes Schindelin

unread,

Aug 10, 2009, 7:51:21 PM8/10/09

to Thorvald Natvig, msysGit

Hi,

On Tue, 11 Aug 2009, Thorvald Natvig wrote:

> > Even if you use 32767 as MAX_PATH, I end up with 8*32kB = 256kB. But I
> > actually wanted to stay with 255, which ends up being .25*8kB = 2kB.
> >
> > But the way you're using the buffers, I think you do not even need 4
> > buffers, but only one.
> >
> I know at need at least 2 (for both arguments to link, symlink, rename etc),
> so just to be cautious I've used 4.

Fair enough.

> > The problem is that you violate the DRY principle (Don't Repeat
> > Yourself), and the consequence is all too easy to see: Assuming that
> > the next patch would be the introduction of a config variable to turn
> > off the UTF-8 assumption, you'd need to change _all_ of the calling
> > sites.
> >
> I made it a buffer-rotating static inline functions. As it turns out,
> all cases where you want to convert wchar_t back to utf8 are cases where
> the return buffer is already allocated somewhere, so I only have a
> direct version for that.

Okay.

> Patch attached. Commits, checkouts, removals, even symlinks work.
>
> To get this done, I had to override pretty much every file function that takes
> a char * pointer, as all of those need to deal with utf8->wchar conversion.

Heh...

> It was extra fun when the basedir contained utf8 that couldn't be
> represented in ANSI, as that meant the chdir() to project root failed.

You mean ASCII? That would be rather bad, no?

> I also had to completely replace mingw_readdir, as the _wdiropen in
> mingw doesn't append the '/*' to the dir name in the entry. Since I had
> to override diropen anyway, I simply made a new handler for it.
>
> Note that neither msys' bash or regular cmd can show unicode characters
> for files or directories.

Hmm. We'll have to think of a solution for that.

> The folder above became 'Folder ????'. However, with cmd, you can "cd
> Folder-<TAB>", and it works, whereas bash reports "No such file or
> directory.". I also haven't looked at adding wide-char argument
> interpretation; so you can't "git add <filename_with_unicode>" directly.
>
> This will probably be attached with tab-expansion and flowed format
> again, but I figured it is still easier to comment on it if it is
> inline. Hmm. Actually, due to the number of changes, espeically to the
> readdir() things, the patch makes it look much more gruesome than it is.
> Do you want me to push it as well, so you can see the "finished" result?

I think it is pretty unwise of me to comment on the patch: I should hit
the feathers. Nevertheless, I cannot resist ;-)

> diff --git a/compat/mingw.c b/compat/mingw.c
> index 20f370b..800c782 100644
> --- a/compat/mingw.c
> +++ b/compat/mingw.c

> @@ -119,6 +120,47 @@ static int err_win_to_posix(DWORD winerr)
> return error;
> }
>
> +static inline wchar_t *utf8_to_wchar(const char *src)

As it uses a static buffer, I doubt that it will get inlined, but I might
be wrong.

> +{
> + /* Allow some extra room for path manipulations
> + */

Please just append the */ if the comment fits in one line; if it is a
multi-line comment, I'd like to have it in the form

/*
* The comment starts on the second line, and every
* line begins with a star that is aligned to the first
* one. The comment ends with a star and a slash on a
* single line.
*/

I note that there are violations of that coding style in compat/mingw.c,
but let's not make things worse.

> + static wchar_t buffer[4][PATH_MAX+4];

Is it really +4? I thought +1 is enough? (I realize I said
2*(PATH_MAX+1) earlier, but of course that's wrong.)

> + static int counter = 0;
> + int result;
> +
> + if (! src)

Please keep only the space after the "if", but not the one after the
exclamation mark.

> + return NULL;
> +
> + if (++counter >= 4)

How about ARRAY_SIZE(buf) instead? Then you do not need to take extra
care when you decide that 4 is not enough after all.

> + counter = 0;
> +
> + result = MultiByteToWideChar(CP_UTF8, 0, src, -1, buffer[counter], PATH_MAX);

This line probably wants to be wrapped.

> + if (result >= 0)
> + return buffer[counter];
> + error("Could not convert utf8 '%s' to widechars: %s", src,
> + result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
> + result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
> + "flags/parameter error");
> + return NULL;
> +}
> +
> +static inline int wchar_to_utf8(char *dst, int length, const wchar_t *src)

Maybe we need some distinction in the name which tells the reader that
this converts into the buffer "dst" instead of returning a pointer? But
then, maybe it would be better to return dst in case of success, and NULL
otherwise?

> +{
> + int result;
> +
> + if (! src)

Again, please remove the space between "!" and "src".

> + return 0;
> +
> + result = WideCharToMultiByte(CP_UTF8, 0, src, -1, dst, length, NULL, NULL);

This line was wrapped around by your mailer; I think it is safe to assume
that it is slightly too long and wants to be wrapped.

> + if (result >= 0)
> + return result;
> + error("Could not convert widechar '%ls' to utf8: %s", src,
> + result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
> + result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
> + "flags/parameter error");
> + return 0;
> +}
> +
> #undef open
> int mingw_open (const char *filename, int oflags, ...)
> {
> @@ -130,7 +172,9 @@ int mingw_open (const char *filename, int oflags, ...)
>
> if (!strcmp(filename, "/dev/null"))
> filename = "nul";
> - int fd = open(filename, oflags, mode);
> +
> +
> + int fd = _wopen(utf8_to_wchar(filename), oflags, mode);

I think we can keep the style from before, not inserting empty lines.

Another suggestion: this modifies existing code, while the function
utf8_to_wchar() is new. Maybe you want to split the patch into
introduction of the new functions and modifying the existing code to make
use of it in a second one?

This is actually a modification of the patch in work/symlink, correct?
Maybe I should apply that to 'devel' then, so that you can modify just the
parts needed for work/utf8?

> +static inline int get_file_attrw(const wchar_t *fname, WIN32_FILE_ATTRIBUTE_DATA *fdata)
> +{
> + if (GetFileAttributesExW(fname, GetFileExInfoStandard, fdata))
> + return 0;
> +
> + switch (GetLastError()) {
> + case ERROR_ACCESS_DENIED:
> + case ERROR_SHARING_VIOLATION:
> + case ERROR_LOCK_VIOLATION:
> + case ERROR_SHARING_BUFFER_EXCEEDED:
> + return EACCES;
> + case ERROR_BUFFER_OVERFLOW:
> + return ENAMETOOLONG;
> + case ERROR_NOT_ENOUGH_MEMORY:
> + return ENOMEM;
> + default:
> + return ENOENT;
> + }
> +}

Could you not use err_win_to_posix() instead?

> @@ -168,16 +292,15 @@ static int do_lstat(const char *file_name, struct stat
> *buf)
> buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
> buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
> if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
> - WIN32_FIND_DATAA findbuf;
> - HANDLE handle = FindFirstFileA(file_name, &findbuf);
> + WIN32_FIND_DATAW findbuf;
> + HANDLE handle = FindFirstFileW(file_name, &findbuf);
> if (handle != INVALID_HANDLE_VALUE) {
> if ((findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT)
> &&
> (findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
> - char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];

Oh? It is no longer necessary? Maybe this is a change that merits its
own patch.

> @@ -435,17 +560,23 @@ struct tm *localtime_r(const time_t *timep, struct tm
> *result)
> return result;
> }
>
> +
> +
> #undef getcwd
> char *mingw_getcwd(char *pointer, int len)

These empty lines are not necessary.

> {
> int i;
> - char *ret = getcwd(pointer, len);
> - if (!ret)
> - return ret;
> - for (i = 0; pointer[i]; i++)
> - if (pointer[i] == '\\')
> - pointer[i] = '/';
> - return ret;
> + wchar_t buffer[PATH_MAX];
> + wchar_t *ret = _wgetcwd(buffer, PATH_MAX);
> + if (!ret || (wcslen(buffer) >= len)) {
> + errno = ERANGE;
> + return (char *) -1;

I think the convention is to return NULL upon error.

> + }
> + for (i = 0; buffer[i]; i++)
> + if (buffer[i] == L'\\')
> + buffer[i] = L'/';
> + wchar_to_utf8(pointer, len, buffer);
> + return pointer;
> }
>
> #undef getenv
> @@ -611,7 +742,7 @@ static void free_path_split(char **path)
> */
> static char *lookup_prog(const char *dir, const char *cmd, int isexe, int
> exe_only)
> {
> - char path[MAX_PATH];
> + char path[PATH_MAX];

This is a good change, but would be better in a separate (trivial) patch.

> @@ -1130,29 +1261,31 @@ int mingw_rename(const char *pold, const char *pnew)
> DWORD attrs, gle;
> int tries = 0;
> static const int delay[] = { 0, 1, 10, 20, 40 };
> + const wchar_t *poldw = utf8_to_wchar(pold);
> + const wchar_t *pneww = utf8_to_wchar(pnew);
>
> /*
> * Try native rename() first to get errno right.
> * It is based on MoveFile(), which cannot overwrite existing files.
> */
> - if (!rename(pold, pnew))
> + if (!_wrename(poldw, pneww))
> return 0;
> if (errno != EEXIST)
> return -1;
> repeat:
> - if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
> + if (MoveFileExW(poldw, pneww, MOVEFILE_REPLACE_EXISTING))
> return 0;
> /* TODO: translate more errors */

Hah, I think this also would benefit from err_win_to_posix()! Also in a
separate patch, of course.

> gle = GetLastError();
> if (gle == ERROR_ACCESS_DENIED &&
> - (attrs = GetFileAttributes(pnew)) != INVALID_FILE_ATTRIBUTES) {
> + (attrs = GetFileAttributesW(pneww)) != INVALID_FILE_ATTRIBUTES) {
> if (attrs & FILE_ATTRIBUTE_DIRECTORY) {
> errno = EISDIR;
> return -1;
> }

> @@ -1417,64 +1528,80 @@ char *getpass(const char *prompt)
> return strbuf_detach(&buf, NULL);
> }
>
> -#ifndef NO_MINGW_REPLACE_READDIR

Hmm. I think it is not wise to break this mode. If the builder asked for
no replacement, that wish should be granted.

However, I realize that this is not possible if we're trying to interpret
everything as UTF-8.

OTOH I would really like to support the decisions of existing projects to
stay with Windows encoding. For Windows-only projects, there is no need
to step away from peculiar encodings that do not play well with other
platforms.

IOW I would like the patch to be guarded by a config option.

> diff --git a/compat/mingw.h b/compat/mingw.h
> index 4f9f7f4..7a7977d 100644
> --- a/compat/mingw.h
> +++ b/compat/mingw.h
> @@ -96,20 +96,6 @@ static inline int fcntl(int fd, int cmd, long arg)
> * simple adaptors
> */
>
> -static inline int mingw_mkdir(const char *path, int mode)
> -{
> - return mkdir(path);
> -}
> -#define mkdir mingw_mkdir
> -
> -static inline int mingw_unlink(const char *pathname)
> -{
> - /* read-only files cannot be removed */
> - chmod(pathname, 0666);
> - return unlink(pathname);
> -}
> -#define unlink mingw_unlink
> -
> static inline int waitpid(pid_t pid, int *status, unsigned options)
> {
> if (options == 0)

Maybe it would be a good idea to move the implementations to mingw.c
first, and then change them in a subsequent patch, to make it more obvious
what happened...

What about rename() and fopen()?

I realize that you worked really hard, and I am impressed by your
progress.

Looking forward (after sleeping!) to further developments,
Dscho

Thorvald Natvig

unread,

Aug 10, 2009, 9:58:28 PM8/10/09

to Johannes Schindelin, msysGit

>> It was extra fun when the basedir contained utf8 that couldn't be
>> represented in ANSI, as that meant the chdir() to project root failed.
>>
>
> You mean ASCII? That would be rather bad, no?
>

No, I mean the default CP of Windows, which once upon a time was the
ANSI one ;) Anyway, the point was that getcwd(), if you're in a unicode
path, returns garbage. In other words; chdir(getcwd()) will fail, which
took a bit of time to figure out :)

>> I also had to completely replace mingw_readdir, as the _wdiropen in
>> mingw doesn't append the '/*' to the dir name in the entry. Since I had
>> to override diropen anyway, I simply made a new handler for it.
>>
>> Note that neither msys' bash or regular cmd can show unicode characters
>> for files or directories.
>>
> Hmm. We'll have to think of a solution for that.
>

The solution is probably to patch msys, so it also knows about
unicode/utf conversion.. And symlinks as well, I guess ;) Anyway, that's
not an itch I've got right now -- I can live perfectly fine without that.

> I think it is pretty unwise of me to comment on the patch: I should hit
> the feathers. Nevertheless, I cannot resist ;-)
>

I know that feeling. I'm about to go sleep myself, but first...

>> diff --git a/compat/mingw.c b/compat/mingw.c
>> index 20f370b..800c782 100644
>> --- a/compat/mingw.c
>> +++ b/compat/mingw.c
>> @@ -119,6 +120,47 @@ static int err_win_to_posix(DWORD winerr)
>> return error;
>> }
>>
>> +static inline wchar_t *utf8_to_wchar(const char *src)
>>
> As it uses a static buffer, I doubt that it will get inlined, but I might
> be wrong.
>

It probably doesn't, but it might in the future, and it doesn't really
hurt, does it?

>> +{
>> + /* Allow some extra room for path manipulations
>> + */
>>
>
> Please just append the */ if the comment fits in one line; if it is a
> multi-line comment, I'd like to have it in the form
>
> /*
> * The comment starts on the second line, and every
> * line begins with a star that is aligned to the first
> * one. The comment ends with a star and a slash on a
> * single line.
> */
>
> I note that there are violations of that coding style in compat/mingw.c,
> but let's not make things worse.
>

Sorry about that, and the extra lines. I didn't think this would be the
final version of the patch, so I haven't gone over it with a
fine-toothed comb yet :)

>> + static wchar_t buffer[4][PATH_MAX+4];
>>
>
> Is it really +4? I thought +1 is enough? (I realize I said
> 2*(PATH_MAX+1) earlier, but of course that's wrong.)
>

In some places I append stuff to the returned buffer. Specifically, for
opendir(), I need to append "\*" to the string. So I know I need 3
bytes, and hence add 4 to be safe.

>
>> +static inline int wchar_to_utf8(char *dst, int length, const wchar_t *src)
>>
>
> Maybe we need some distinction in the name which tells the reader that
> this converts into the buffer "dst" instead of returning a pointer? But
> then, maybe it would be better to return dst in case of success, and NULL
> otherwise?
>

I sort of based it on the idea of strcpy and friends. And at one point I
needed the length, but I never got to it (see below).
Changed to return char * for now.. Actually, if we can make it clear it
does return the length(), that is more usefull for me. Suggestion for
function name? strnuwcpy?

> I think we can keep the style from before, not inserting empty lines.
>
> Another suggestion: this modifies existing code, while the function
> utf8_to_wchar() is new. Maybe you want to split the patch into
> introduction of the new functions and modifying the existing code to make
> use of it in a second one?
>

I don't know. All of these sort of belong together; you don't need
utf8_to_wchar if you don't need it anywhere, so a patch to add it
without using it seems "odd" to me :) Or do you mean split the existing
patch into a series of patches, with individual commit messages?

>> +static int do_readlink(const wchar_t *path, wchar_t *buf, size_t bufsiz)
>> +{
>>

(snip snip)

> This is actually a modification of the patch in work/symlink, correct?
> Maybe I should apply that to 'devel' then, so that you can modify just the
> parts needed for work/utf8?
>

Well, since I call 'readlink' functionality both from the actual
readlink and from lstat() (to get 'size' of symlinks), I made it a
'internal' function that worked on wchar_t.

>> +static inline int get_file_attrw(const wchar_t *fname, WIN32_FILE_ATTRIBUTE_DATA *fdata)
>> +{
>> + if (GetFileAttributesExW(fname, GetFileExInfoStandard, fdata))
>> + return 0;
>> +
>> + switch (GetLastError()) {
>> + case ERROR_ACCESS_DENIED:
>> + case ERROR_SHARING_VIOLATION:
>> + case ERROR_LOCK_VIOLATION:
>> + case ERROR_SHARING_BUFFER_EXCEEDED:
>> + return EACCES;
>> + case ERROR_BUFFER_OVERFLOW:
>> + return ENAMETOOLONG;
>> + case ERROR_NOT_ENOUGH_MEMORY:
>> + return ENOMEM;
>> + default:
>> + return ENOENT;
>> + }
>> +}
>>
>
> Could you not use err_win_to_posix() instead?
>

I can. This is a clone of get_file_attr() from win32.h. We only uses it
in lstat() though, so I can just make it call GetFileAttributes directly
and use err_win_to_posix().

>> @@ -168,16 +292,15 @@ static int do_lstat(const char *file_name, struct stat
>> *buf)
>> buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
>> buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
>> if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
>> - WIN32_FIND_DATAA findbuf;
>> - HANDLE handle = FindFirstFileA(file_name, &findbuf);
>> + WIN32_FIND_DATAW findbuf;
>> + HANDLE handle = FindFirstFileW(file_name, &findbuf);
>> if (handle != INVALID_HANDLE_VALUE) {
>> if ((findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT)
>> &&
>> (findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
>> - char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
>>
>
> Oh? It is no longer necessary? Maybe this is a change that merits its
> own patch.
>

When I made "readlink" internal above, I made it specialcase the case
where you don't pass the output buffer : only return the length of the
link (which is what we need in lstat, we don't care what it points to).

>
>> {
>> int i;
>> - char *ret = getcwd(pointer, len);
>> - if (!ret)
>> - return ret;
>> - for (i = 0; pointer[i]; i++)
>> - if (pointer[i] == '\\')
>> - pointer[i] = '/';
>> - return ret;
>> + wchar_t buffer[PATH_MAX];
>> + wchar_t *ret = _wgetcwd(buffer, PATH_MAX);
>> + if (!ret || (wcslen(buffer) >= len)) {
>> + errno = ERANGE;
>> + return (char *) -1;
>>
>
> I think the convention is to return NULL upon error.
>

Uhm. Just ignore that, I'm being silly. _wgetcwd() returns a wchar_t
pointer, whereas 'getcwd' on *nix is supposed to return the length of
the returned data.
Oh, and I think this was the case I wrote the 'return length from utf8
conversion' idea for.

>> - char path[MAX_PATH];
>> + char path[PATH_MAX];
>>
>
> This is a good change, but would be better in a separate (trivial) patch.
>

Oops. I mistakenly used MAX_PATH (the Win32 constant) initially, then
saw you used PATH_MAX, and did a search/replace on the document. So this
change wasn't intentional :)

>> gle = GetLastError();
>> if (gle == ERROR_ACCESS_DENIED &&
>> - (attrs = GetFileAttributes(pnew)) != INVALID_FILE_ATTRIBUTES) {
>> + (attrs = GetFileAttributesW(pneww)) != INVALID_FILE_ATTRIBUTES) {
>> if (attrs & FILE_ATTRIBUTE_DIRECTORY) {
>> errno = EISDIR;
>> return -1;
>> }
>> @@ -1417,64 +1528,80 @@ char *getpass(const char *prompt)
>> return strbuf_detach(&buf, NULL);
>> }
>>
>> -#ifndef NO_MINGW_REPLACE_READDIR
>>
>
> Hmm. I think it is not wise to break this mode. If the builder asked for
> no replacement, that wish should be granted.
>
> However, I realize that this is not possible if we're trying to interpret
> everything as UTF-8.
>
> OTOH I would really like to support the decisions of existing projects to
> stay with Windows encoding. For Windows-only projects, there is no need
> to step away from peculiar encodings that do not play well with other
> platforms.
>
> IOW I would like the patch to be guarded by a config option.
>

What do you mean by "config option" in this case? If you mean
compile-time option, with #ifdefs, that's a LOT of code that has to be
written twice, with obvious implications for maintainability. Take
'link', for example, which has to look up different symbols for 'A' and
'W' mode. I think it would just introduce a lot of pain :(

If you mean a runtime 'git --config' option, I haven't looked at adding
config options, but if that is doable, we can simply modify the to/from
unicode functions to use the GetACP() codepage instead of UTF8 if the
config option "win32.noutf8" is true? Hm. Or we could use
"i18n.commitEncoding" for a new and interesting purpose. Anyway, this is
doable; we'd keep using wchar_t functions, we'd just change what
codepage we translate to/from.

> Maybe it would be a good idea to move the implementations to mingw.c
> first, and then change them in a subsequent patch, to make it more obvious
> what happened...
>

We could, but do I really have to create a patch with code that will be
immediately replaced? It seems odd to me to have to create a

#undef unlink
int mingw_unlink(const char *path)
{
return unlink(path);
}

function.. then immediately replace it with something with "meat on the
bones" in the very next patch. But if that's the way you prefer it, I'll
see what can be done. I'd prefer if we can agree on what the end result
should be first, though :)

> What about rename() and fopen()?
>

rename() was already overridden, it's the function that says /* TODO:
translate more errors */ ;)

I'd forgotten fopen(). Things still worked fine without it, but that
could just be because I didn't test it enough. I'll get it added.

Best Regards,
Thorvald

Johannes Schindelin

unread,

Aug 11, 2009, 4:29:29 AM8/11/09

to Thorvald Natvig, msysGit

Hi,

On Tue, 11 Aug 2009, Thorvald Natvig wrote:

> > > It was extra fun when the basedir contained utf8 that couldn't be
> > > represented in ANSI, as that meant the chdir() to project root
> > > failed.
> > >
> >
> > You mean ASCII? That would be rather bad, no?
>
> No, I mean the default CP of Windows, which once upon a time was the
> ANSI one ;) Anyway, the point was that getcwd(), if you're in a unicode
> path, returns garbage. In other words; chdir(getcwd()) will fail, which
> took a bit of time to figure out :)

That is terrible. Just to make sure I understand correctly: when in a
UTF-8 directory, the _plain_ Windows chdir() and getcwd() (i.e. _not_ the
compat/mingw.c versions, if any) misbehave?

If so: how? Does getcwd() return the UTF-8 path, rather than using the
current codepage, or does chdir() expect something different?

> > > I also had to completely replace mingw_readdir, as the _wdiropen in
> > > mingw doesn't append the '/*' to the dir name in the entry. Since I
> > > had to override diropen anyway, I simply made a new handler for it.
> > >
> > > Note that neither msys' bash or regular cmd can show unicode
> > > characters for files or directories.
> > >
> > Hmm. We'll have to think of a solution for that.
>
> The solution is probably to patch msys, so it also knows about
> unicode/utf conversion.. And symlinks as well, I guess ;) Anyway, that's
> not an itch I've got right now -- I can live perfectly fine without
> that.

Well, I guess it will become my itch.

But as we already have /src/rt/ in the 'msys' branch, and since you
already laid out what needs to be done, I think it should not be terribly
difficult.

> > I think it is pretty unwise of me to comment on the patch: I should
> > hit the feathers. Nevertheless, I cannot resist ;-)
>
> I know that feeling. I'm about to go sleep myself,

:-)

> but first...
>
> > > diff --git a/compat/mingw.c b/compat/mingw.c
> > > index 20f370b..800c782 100644
> > > --- a/compat/mingw.c
> > > +++ b/compat/mingw.c
> > > @@ -119,6 +120,47 @@ static int err_win_to_posix(DWORD winerr)
> > > return error;
> > > }
> > >
> > > +static inline wchar_t *utf8_to_wchar(const char *src)
> > >
> > As it uses a static buffer, I doubt that it will get inlined, but I
> > might be wrong.
>
> It probably doesn't, but it might in the future, and it doesn't really
> hurt, does it?

It does not hurt, indeed. I just wanted to prove that I am still able to
think even if I'm dead tired.

> > > +{
> > > + /* Allow some extra room for path manipulations
> > > + */
> > >
> >
> > Please just append the */ if the comment fits in one line; if it is a
> > multi-line comment, I'd like to have it in the form
> >
> > /*
> > * The comment starts on the second line, and every
> > * line begins with a star that is aligned to the first
> > * one. The comment ends with a star and a slash on a
> > * single line.
> > */
> >
> > I note that there are violations of that coding style in
> > compat/mingw.c, but let's not make things worse.
>
> Sorry about that, and the extra lines. I didn't think this would be the
> final version of the patch, so I haven't gone over it with a
> fine-toothed comb yet
> :)

Hey, no problem. I was just offering comments -- this is what Open Source
is about, right?

> > > + static wchar_t buffer[4][PATH_MAX+4];
> > >
> >
> > Is it really +4? I thought +1 is enough? (I realize I said
> > 2*(PATH_MAX+1) earlier, but of course that's wrong.)
>
> In some places I append stuff to the returned buffer. Specifically, for
> opendir(), I need to append "\*" to the string. So I know I need 3
> bytes, and hence add 4 to be safe.

s/bytes/characters/

Maybe a comment would be good, then, so that no overzealous janitor breaks
the code.

> > > +static inline int wchar_to_utf8(char *dst, int length, const wchar_t
> > > *src)
> > >
> >
> > Maybe we need some distinction in the name which tells the reader that this
> > converts into the buffer "dst" instead of returning a pointer? But then,
> > maybe it would be better to return dst in case of success, and NULL
> > otherwise?
>
> I sort of based it on the idea of strcpy and friends. And at one point I
> needed the length, but I never got to it (see below).
>
> Changed to return char * for now.. Actually, if we can make it clear it
> does return the length(), that is more usefull for me. Suggestion for
> function name? strnuwcpy?

I just remembered (sorry, I may have been indeed too tired last night)
that there is a function which does exactly what you want: wcstombs().
Its signature looks like this:

size_t wcstombs(char *dest, const wchar_t *src, size_t n);

It's part of the C99 standard, according to my documentation.

> > I think we can keep the style from before, not inserting empty lines.
> >
> > Another suggestion: this modifies existing code, while the function
> > utf8_to_wchar() is new. Maybe you want to split the patch into
> > introduction of the new functions and modifying the existing code to
> > make use of it in a second one?
>
> I don't know. All of these sort of belong together; you don't need
> utf8_to_wchar if you don't need it anywhere, so a patch to add it
> without using it seems "odd" to me :) Or do you mean split the existing
> patch into a series of patches, with individual commit messages?

Hmm, you're probably right.

> > > +static int do_readlink(const wchar_t *path, wchar_t *buf, size_t bufsiz)
> > > +{
> > >
> (snip snip)
>
> > This is actually a modification of the patch in work/symlink, correct?
> > Maybe I should apply that to 'devel' then, so that you can modify just
> > the parts needed for work/utf8?
>
> Well, since I call 'readlink' functionality both from the actual
> readlink and from lstat() (to get 'size' of symlinks), I made it a
> 'internal' function that worked on wchar_t.

Yes. The patch _may_ be easier to read if you made that split an own
patch, but that's your decision, not mine.

> > > +static inline int get_file_attrw(const wchar_t *fname,
> > > WIN32_FILE_ATTRIBUTE_DATA *fdata)
> > > +{
> > > + if (GetFileAttributesExW(fname, GetFileExInfoStandard, fdata))
> > > + return 0;
> > > +
> > > + switch (GetLastError()) {
> > > + case ERROR_ACCESS_DENIED:
> > > + case ERROR_SHARING_VIOLATION:
> > > + case ERROR_LOCK_VIOLATION:
> > > + case ERROR_SHARING_BUFFER_EXCEEDED:
> > > + return EACCES;
> > > + case ERROR_BUFFER_OVERFLOW:
> > > + return ENAMETOOLONG;
> > > + case ERROR_NOT_ENOUGH_MEMORY:
> > > + return ENOMEM;
> > > + default:
> > > + return ENOENT;
> > > + }
> > > +}
> > >
> >
> > Could you not use err_win_to_posix() instead?
>
> I can. This is a clone of get_file_attr() from win32.h. We only uses it
> in lstat() though, so I can just make it call GetFileAttributes directly
> and use err_win_to_posix().

Would be shorter, and less error-prone, right?

> > > @@ -168,16 +292,15 @@ static int do_lstat(const char *file_name, struct
> > > stat
> > > *buf)
> > > buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
> > > buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
> > > if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
> > > - WIN32_FIND_DATAA findbuf;
> > > - HANDLE handle = FindFirstFileA(file_name, &findbuf);
> > > + WIN32_FIND_DATAW findbuf;
> > > + HANDLE handle = FindFirstFileW(file_name, &findbuf);
> > > if (handle != INVALID_HANDLE_VALUE) {
> > > if ((findbuf.dwFileAttributes &
> > > FILE_ATTRIBUTE_REPARSE_POINT)
> > > &&
> > > (findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
> > > - char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
> > >
> >
> > Oh? It is no longer necessary? Maybe this is a change that merits
> > its own patch.
>
> When I made "readlink" internal above, I made it specialcase the case
> where you don't pass the output buffer : only return the length of the
> link (which is what we need in lstat, we don't care what it points to).

Okay...

> > > {
> > > int i;
> > > - char *ret = getcwd(pointer, len);
> > > - if (!ret)
> > > - return ret;
> > > - for (i = 0; pointer[i]; i++)
> > > - if (pointer[i] == '\\')
> > > - pointer[i] = '/';
> > > - return ret;
> > > + wchar_t buffer[PATH_MAX];
> > > + wchar_t *ret = _wgetcwd(buffer, PATH_MAX);
> > > + if (!ret || (wcslen(buffer) >= len)) {
> > > + errno = ERANGE;
> > > + return (char *) -1;
> > >
> >
> > I think the convention is to return NULL upon error.
>
> Uhm. Just ignore that, I'm being silly. _wgetcwd() returns a wchar_t
> pointer, whereas 'getcwd' on *nix is supposed to return the length of
> the returned data.
>
> Oh, and I think this was the case I wrote the 'return length from utf8
> conversion' idea for.

Okay.

> > > - char path[MAX_PATH];
> > > + char path[PATH_MAX];
> > >
> >
> > This is a good change, but would be better in a separate (trivial)
> > patch.
>
> Oops. I mistakenly used MAX_PATH (the Win32 constant) initially, then
> saw you used PATH_MAX, and did a search/replace on the document. So this
> change wasn't intentional :)

Heh. I think we have a commit in git.git that unifies all these
disambiguities by replacing MAX_PATH with PATH_MAX.

That sounds sensible.

For an example what I meant by "config option", please see:

http://repo.or.cz/w/git/mingw/4msysgit.git?a=commitdiff;h=dc0019461c584bfed81485bb7dc6854d282f28b6

> > Maybe it would be a good idea to move the implementations to mingw.c
> > first, and then change them in a subsequent patch, to make it more
> > obvious what happened...
>
> We could, but do I really have to create a patch with code that will be
> immediately replaced? It seems odd to me to have to create a
>
> #undef unlink
> int mingw_unlink(const char *path)
> {
> return unlink(path);
> }
>
> function.. then immediately replace it with something with "meat on the
> bones" in the very next patch. But if that's the way you prefer it, I'll
> see what can be done. I'd prefer if we can agree on what the end result
> should be first, though :)

Well, like I said before: I just offer comments, it is your decision,
though.

> > What about rename() and fopen()?
>
> rename() was already overridden, it's the function that says /* TODO:
> translate more errors */ ;)
>
> I'd forgotten fopen(). Things still worked fine without it, but that could
> just be because I didn't test it enough. I'll get it added.

Thanks for the clarification!

Oh, and thanks for your wonderful work!

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 11, 2009, 10:02:58 AM8/11/09

to Johannes Schindelin, msysGit

> That is terrible. Just to make sure I understand correctly: when in a
> UTF-8 directory, the _plain_ Windows chdir() and getcwd() (i.e. _not_ the
> compat/mingw.c versions, if any) misbehave?
>
> If so: how? Does getcwd() return the UTF-8 path, rather than using the
> current codepage, or does chdir() expect something different?
>

If you're in a directory which has unicode characters that can't be
represented in the current codepage, the plain windows getcwd() still
tries to represent every unicode symbol with one character, which
becomes nonsense. And since it's nonsense, you can't chdir() to it.
But as I had to override mkdir, rmdir etc anyway (since directory-names
in the repository might have utf8), adding getcwd wasn't much work.

> Well, I guess it will become my itch.
>

This only applies if you use characters outside your "local codepage".
So your ß and Ü is representable; they do show up ;) This only becomes a
problem when you use, e.g. japanese or chinese letters (which I what I
tested with -- wikipedia has an abundance of examples).

>> Sorry about that, and the extra lines. I didn't think this would be the
>> final version of the patch, so I haven't gone over it with a
>> fine-toothed comb yet
>> :)
>>
>
> Hey, no problem. I was just offering comments -- this is what Open Source
> is about, right?
>

Indeed :)

>> In some places I append stuff to the returned buffer. Specifically, for
>> opendir(), I need to append "\*" to the string. So I know I need 3
>> bytes, and hence add 4 to be safe.
>>
>
> s/bytes/characters/
>
> Maybe a comment would be good, then, so that no overzealous janitor breaks
> the code.
>

Good point.

> I just remembered (sorry, I may have been indeed too tired last night)
> that there is a function which does exactly what you want: wcstombs().
> Its signature looks like this:
>
> size_t wcstombs(char *dest, const wchar_t *src, size_t n);
>
> It's part of the C99 standard, according to my documentation.
>

And it's also in the Windows CRT, it seems. But it doesn't do UTF8 :(
I made a mingw_wcstombs and redirected with a #define.

> Yes. The patch _may_ be easier to read if you made that split an own
> patch, but that's your decision, not mine.
>

As long as whoever will commit it can read it, I care mostly about the
end result, to be honest :)

>>> Could you not use err_win_to_posix() instead?
>>>
>> I can. This is a clone of get_file_attr() from win32.h. We only uses it
>> in lstat() though, so I can just make it call GetFileAttributes directly
>> and use err_win_to_posix().
>>
> Would be shorter, and less error-prone, right?
>

Indeed. But the old function used get_file_attr() in that one location,
so I thought I'd keep the diff down ;) ... As it turns out, it will
actually be fewer lines, so I'll make that change.

>> When I made "readlink" internal above, I made it specialcase the case
>> where you don't pass the output buffer : only return the length of the
>> link (which is what we need in lstat, we don't care what it points to).
>>
>
> Okay...
>

And I reverted it again, because naturally, when you call lstat() and
readlink(), you want them to return the number of utf8 characters, not
the number of unicode characters, so you need the actual buffer so you
can utf8 convert it.
My bad, fixed :)

>> Uhm. Just ignore that, I'm being silly. _wgetcwd() returns a wchar_t
>> pointer, whereas 'getcwd' on *nix is supposed to return the length of
>> the returned data.
>>
>> Oh, and I think this was the case I wrote the 'return length from utf8
>> conversion' idea for.
>>
> Okay.
>

That's what I get for reading manpages at 2 AM.
man 2 getcwd is the Linux kernel version, and does return a long.
man 3 getcwd is the POSIX one, and does return a char * ;)

>>> Hmm. I think it is not wise to break this mode. If the builder asked
>>> for no replacement, that wish should be granted.
>>>
>>> However, I realize that this is not possible if we're trying to
>>> interpret everything as UTF-8.
>>>
>>> OTOH I would really like to support the decisions of existing projects
>>> to stay with Windows encoding. For Windows-only projects, there is no
>>> need to step away from peculiar encodings that do not play well with
>>> other platforms.
>>>
>>> IOW I would like the patch to be guarded by a config option.
>>>
>> What do you mean by "config option" in this case? If you mean
>> compile-time option, with #ifdefs, that's a LOT of code that has to be
>> written twice, with obvious implications for maintainability. Take
>> 'link', for example, which has to look up different symbols for 'A' and
>> 'W' mode. I think it would just introduce a lot of pain :(
>>
>> If you mean a runtime 'git --config' option, I haven't looked at adding
>> config options, but if that is doable, we can simply modify the to/from
>> unicode functions to use the GetACP() codepage instead of UTF8 if the
>> config option "win32.noutf8" is true? Hm. Or we could use
>> "i18n.commitEncoding" for a new and interesting purpose. Anyway, this is
>> doable; we'd keep using wchar_t functions, we'd just change what
>> codepage we translate to/from.
>>
>
> That sounds sensible.
>
> For an example what I meant by "config option", please see:
>
> http://repo.or.cz/w/git/mingw/4msysgit.git?a=commitdiff;h=dc0019461c584bfed81485bb7dc6854d282f28b6
>

Ah, wonderful example :)
Would you accept that as a 2nd commit, done on top of this one once if
it is accepted? I have.. bad experience maintaining a patch series, I
always end up invalidating the later patches, so I'd appreciate if we
could do this in two steps ;)

> Oh, and thanks for your wonderful work!
>

You're very welcome. This is a great project :)

Updated patch attached. This does one further change; it uses
FindFirstFileW for lstat() instead of GetFileAttributesW. FindFirstFile
is actually marginally faster (measured by using 'time git status' on
the full msysgit repo, and google indicates others have found the same
result), and it makes the lstat() code to check for symlinks much cleaner.
Most of the patch reads cleanly, except the rewrite of readdir, where I
have hard time seeing the similarity between the diff and my code, so
I've cut&paste'd a full copy of that function at the very end.

Looking forward to more comments :)

From 2241371fefd823e34bcf2f37bd140bb9ba7a0a72 Mon Sep 17 00:00:00 2001

From: Thorvald Natvig <sli...@users.sourceforge.net>
Date: Mon, 10 Aug 2009 23:44:10 +0200
Subject: [PATCH 2/2] UTF8 Support

---
compat/mingw.c | 389
+++++++++++++++++++++++++++++++++++++-------------------
compat/mingw.h | 46 ++++---
2 files changed, 288 insertions(+), 147 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 20f370b..bdd7973 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c

@@ -2,6 +2,7 @@
#include "win32.h"
#include <conio.h>
#include <winioctl.h>
+#include <wchar.h>
#include "../strbuf.h"

unsigned int _CRT_fmode = _O_BINARY;

@@ -119,7 +120,71 @@ static int err_win_to_posix(DWORD winerr)
return error;
}

-#undef open

+static inline wchar_t *utf8_to_wchar(const char *src)

+{
+ /* Add room for some extra characters, as there are path
+ * manipulations done in e.g. opendir().
+ */

+ static wchar_t buffer[4][PATH_MAX+4];

+ static int counter = 0;
+ int result;
+
+ if (!src)

+ return NULL;
+
+ if (++counter >= ARRAY_SIZE(buffer))

+ counter = 0;
+
+ result = MultiByteToWideChar(CP_UTF8, 0, src, -1,

+ buffer[counter], PATH_MAX);

+ if (result >= 0)
+ return buffer[counter];
+ error("Could not convert utf8 '%s' to widechars: %s", src,
+ result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
+ result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
+ "flags/parameter error");
+ return NULL;
+}
+

+/* UTF-8 aware wcstombs replacement. */
+static inline size_t mingw_wcstombs(char *dst, const wchar_t *src,
size_t length)

+{
+ int result;
+
+ if (!src)

+ return -1;
+ /* Only a length check? wcstombs indicates this with !dst,
+ * WideCharToMultiByte uses !length
+ */
+ if (!dst)
+ length = 0;

+ result = WideCharToMultiByte(CP_UTF8, 0, src, -1,

+ dst, length, NULL, NULL);
+ if (result > 0) {
+ if (length == 0)
+ return result-1;
+
+ /* WideCharToMultiByte returns size with zero-terminator.
+ * wcstombs returns size without zero-terminator.
+ */
+ if (dst[result-1] == 0)
+ result--;
+ else if (result == length)
+ dst[length-1] = 0;
+ else
+ dst[result] = 0;
+
+ return result;
+ }

+ error("Could not convert widechar '%ls' to utf8: %s", src,
+ result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
+ result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
+ "flags/parameter error");

+ return -1;
+}
+#undef wcstombs
+#define wcstombs mingw_wcstombs
+

int mingw_open (const char *filename, int oflags, ...)
{

va_list args;
@@ -130,7 +195,7 @@ int mingw_open (const char *filename, int oflags, ...)

if (!strcmp(filename, "/dev/null"))
filename = "nul";
- int fd = open(filename, oflags, mode);

+ int fd = _wopen(utf8_to_wchar(filename), oflags, mode);

if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
DWORD attrs = GetFileAttributes(filename);
if (attrs != INVALID_FILE_ATTRIBUTES && (attrs &
FILE_ATTRIBUTE_DIRECTORY))

@@ -139,6 +204,36 @@ int mingw_open (const char *filename, int oflags, ...)
return fd;
}

+FILE *mingw_fopen(const char *path, const char *mode)
+{
+ return _wfopen(utf8_to_wchar(path), utf8_to_wchar(mode));
+}
+

@@ -147,44 +242,75 @@ static inline time_t filetime_to_time_t(const
FILETIME *ft)
return (time_t)winTime;
}

+static int do_readlink(const wchar_t *path, wchar_t *buf, size_t bufsiz)
+{

+ HANDLE handle = CreateFileW(path, GENERIC_READ,
+ FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
+ NULL, OPEN_EXISTING,
+ FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
+ NULL);
+
+ if (handle != INVALID_HANDLE_VALUE) {
+ unsigned char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
+ DWORD dummy = 0;
+ if (DeviceIoControl(handle, FSCTL_GET_REPARSE_POINT, NULL, 0,
buffer,
+ MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dummy, NULL)) {
+ REPARSE_DATA_BUFFER *b = (REPARSE_DATA_BUFFER *) buffer;
+ if (b->ReparseTag == IO_REPARSE_TAG_SYMLINK) {
+ int len =
b->SymbolicLinkReparseBuffer.SubstituteNameLength / sizeof(wchar_t);
+ int offset =
b->SymbolicLinkReparseBuffer.SubstituteNameOffset / sizeof(wchar_t);

+ len = (bufsiz < len) ? bufsiz : len;
+ wcsncpy(buf, &
b->SymbolicLinkReparseBuffer.PathBuffer[offset], len);
+ buf[len] = 0;
+ CloseHandle(handle);
+ return len;
+ }
+ }
+
+ CloseHandle(handle);
+ }
+
+ errno = EINVAL;
+ return -1;
+}

+
/* We keep the do_lstat code in a separate function to avoid recursion.
* When a path ends with a slash, the stat will fail with ENOENT. In
* this case, we strip the trailing slashes and stat again.
*/
-static int do_lstat(const char *file_name, struct stat *buf)
+static int do_lstat(const wchar_t *file_name, struct stat *buf)
{

- WIN32_FILE_ATTRIBUTE_DATA fdata;
+ WIN32_FIND_DATAW fdata;
+ HANDLE handle = FindFirstFileW(file_name, &fdata);

- if (!(errno = get_file_attr(file_name, &fdata))) {

- buf->st_ino = 0;
- buf->st_gid = 0;
- buf->st_uid = 0;
- buf->st_nlink = 1;
- buf->st_mode = file_attr_to_st_mode(fdata.dwFileAttributes);
- buf->st_size = fdata.nFileSizeLow |
- (((off_t)fdata.nFileSizeHigh)<<32);
- buf->st_dev = buf->st_rdev = 0; /* not used by Git */
- buf->st_atime = filetime_to_time_t(&(fdata.ftLastAccessTime));
- buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
- buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
- if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {

- WIN32_FIND_DATAA findbuf;
- HANDLE handle = FindFirstFileA(file_name, &findbuf);

- if (handle != INVALID_HANDLE_VALUE) {

- if ((findbuf.dwFileAttributes &
FILE_ATTRIBUTE_REPARSE_POINT) &&
- (findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
- char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
- buf->st_mode = S_IREAD | S_IFLNK;
- if (!(findbuf.dwFileAttributes &
FILE_ATTRIBUTE_READONLY))
- buf->st_mode |= S_IWRITE;

- buf->st_size = readlink(file_name, buffer,
MAXIMUM_REPARSE_DATA_BUFFER_SIZE);

- }
- FindClose(handle);
- }
- }
- return 0;
+ if (handle == INVALID_HANDLE_VALUE) {
+ errno = err_win_to_posix(GetLastError());
+ return -1;
}
- return -1;
+
+ FindClose(handle);
+
+ buf->st_ino = 0;
+ buf->st_gid = 0;
+ buf->st_uid = 0;
+ buf->st_nlink = 1;
+ buf->st_mode = file_attr_to_st_mode(fdata.dwFileAttributes);
+ buf->st_size = fdata.nFileSizeLow |
+ (((off_t)fdata.nFileSizeHigh)<<32);
+ buf->st_dev = buf->st_rdev = 0; /* not used by Git */
+ buf->st_atime = filetime_to_time_t(&(fdata.ftLastAccessTime));
+ buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
+ buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
+ if ((fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
+ (fdata.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
+ wchar_t buffer[MAX_PATH];
+ buf->st_mode = S_IREAD | S_IFLNK;
+ if (!(fdata.dwFileAttributes & FILE_ATTRIBUTE_READONLY))
+ buf->st_mode |= S_IWRITE;
+ if (do_readlink(file_name, buffer, MAX_PATH) != -1)
+ buf->st_size = wcstombs(NULL, buffer, 0);
+ }
+ return 0;
}

/* We provide our own lstat/fstat functions, since the provided
@@ -196,9 +322,10 @@ static int do_lstat(const char *file_name, struct
stat *buf)

int mingw_lstat(const char *file_name, struct stat *buf)
{
int namelen;
- static char alt_name[PATH_MAX];

- if (!do_lstat(file_name, buf))
+ wchar_t *file_name_w = utf8_to_wchar(file_name);
+
+ if (!do_lstat(file_name_w, buf))
return 0;

/* if file_name ended in a '/', Windows returned ENOENT;

@@ -207,17 +334,17 @@ int mingw_lstat(const char *file_name, struct stat
*buf)

if (errno != ENOENT)
return -1;

- namelen = strlen(file_name);
- if (namelen && file_name[namelen-1] != '/')
+ namelen = wcslen(file_name_w);
+
+ if (namelen && file_name_w[namelen-1] != L'/')
return -1;
- while (namelen && file_name[namelen-1] == '/')
+ while (namelen && file_name_w[namelen-1] == L'/')
--namelen;
if (!namelen || namelen >= PATH_MAX)
return -1;

- memcpy(alt_name, file_name, namelen);
- alt_name[namelen] = 0;
- return do_lstat(alt_name, buf);
+ file_name_w[namelen] = 0;
+ return do_lstat(file_name_w, buf);
}

#undef fstat

@@ -265,7 +392,7 @@ int mingw_utime (const char *file_name, const struct

utimbuf *times)
int fh, rc;

/* must have write permission */
- if ((fh = open(file_name, O_RDWR | O_BINARY)) < 0)
+ if ((fh = _wopen(utf8_to_wchar(file_name), O_RDWR | O_BINARY)) < 0)
return -1;

time_t_to_filetime(times->modtime, &mft);

@@ -287,10 +414,11 @@ unsigned int sleep (unsigned int seconds)

int mkstemp(char *template)
{
- char *filename = mktemp(template);
+ wchar_t *filename = _wmktemp(utf8_to_wchar(template));
if (filename == NULL)
return -1;
- return open(filename, O_RDWR | O_CREAT, 0600);

+ wcstombs(template, filename, strlen(template)+1);

+ return _wopen(filename, O_RDWR | O_CREAT, 0600);
}

int gettimeofday(struct timeval *tv, void *tz)

@@ -435,17 +563,22 @@ struct tm *localtime_r(const time_t *timep, struct
tm *result)
return result;
}

-#undef getcwd

char *mingw_getcwd(char *pointer, int len)

{
int i;
- char *ret = getcwd(pointer, len);

+ wchar_t buffer[PATH_MAX];
+ wchar_t *ret = _wgetcwd(buffer, PATH_MAX);

+ /* _wgetcwd sets errno correctly. */

if (!ret)
- return ret;
- for (i = 0; pointer[i]; i++)
- if (pointer[i] == '\\')
- pointer[i] = '/';
- return ret;

+ return NULL;

+ for (i = 0; buffer[i]; i++)
+ if (buffer[i] == L'\\')
+ buffer[i] = L'/';

+ if (wcstombs(pointer, buffer, len) >= len) {
+ errno = ERANGE;
+ return NULL;
+ }

+ return pointer;
}

#undef getenv

@@ -1130,29 +1263,31 @@ int mingw_rename(const char *pold, const char *pnew)

DWORD attrs, gle;
int tries = 0;
static const int delay[] = { 0, 1, 10, 20, 40 };
+ const wchar_t *poldw = utf8_to_wchar(pold);
+ const wchar_t *pneww = utf8_to_wchar(pnew);

/*
* Try native rename() first to get errno right.
* It is based on MoveFile(), which cannot overwrite existing files.
*/
- if (!rename(pold, pnew))
+ if (!_wrename(poldw, pneww))
return 0;
if (errno != EEXIST)
return -1;
repeat:
- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
+ if (MoveFileExW(poldw, pneww, MOVEFILE_REPLACE_EXISTING))
return 0;

/* TODO: translate more errors */

gle = GetLastError();
if (gle == ERROR_ACCESS_DENIED &&
- (attrs = GetFileAttributes(pnew)) != INVALID_FILE_ATTRIBUTES) {
+ (attrs = GetFileAttributesW(pneww)) != INVALID_FILE_ATTRIBUTES) {
if (attrs & FILE_ATTRIBUTE_DIRECTORY) {
errno = EISDIR;
return -1;
}

if ((attrs & FILE_ATTRIBUTE_READONLY) &&
- SetFileAttributes(pnew, attrs & ~FILE_ATTRIBUTE_READONLY)) {

- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))

+ SetFileAttributesW(pneww, attrs & ~FILE_ATTRIBUTE_READONLY)) {

+ if (MoveFileExW(poldw, pneww, MOVEFILE_REPLACE_EXISTING))
return 0;

gle = GetLastError();
/* revert file attributes on failure */

@@ -1331,11 +1466,11 @@ void mingw_open_html(const char *unixpath)

int link(const char *oldpath, const char *newpath)
{
- typedef BOOL WINAPI (*T)(const char*, const char*,
LPSECURITY_ATTRIBUTES);
+ typedef BOOL WINAPI (*T)(const wchar_t*, const wchar_t*,
LPSECURITY_ATTRIBUTES);
static T create_hard_link = NULL;
if (!create_hard_link) {
create_hard_link = (T) GetProcAddress(
- GetModuleHandle("kernel32.dll"), "CreateHardLinkA");
+ GetModuleHandle("kernel32.dll"), "CreateHardLinkW");
if (!create_hard_link)
create_hard_link = (T)-1;
}

@@ -1343,7 +1478,7 @@ int link(const char *oldpath, const char *newpath)

errno = ENOSYS;
return -1;
}
- if (!create_hard_link(newpath, oldpath, NULL)) {
+ if (!create_hard_link(utf8_to_wchar(newpath),
utf8_to_wchar(oldpath), NULL)) {
errno = err_win_to_posix(GetLastError());
return -1;
}

@@ -1352,11 +1487,11 @@ int link(const char *oldpath, const char *newpath)

int symlink(const char *oldpath, const char *newpath)
{
- typedef BOOL WINAPI (*symlink_fn)(const char*, const char*, DWORD);
+ typedef BOOL WINAPI (*symlink_fn)(const wchar_t*, const wchar_t*,
DWORD);
static symlink_fn create_symbolic_link = NULL;
if (!create_symbolic_link) {
create_symbolic_link = (symlink_fn) GetProcAddress(
- GetModuleHandle("kernel32.dll"), "CreateSymbolicLinkA");
+ GetModuleHandle("kernel32.dll"), "CreateSymbolicLinkW");
if (!create_symbolic_link)
create_symbolic_link = (symlink_fn)-1;
}

@@ -1365,7 +1500,7 @@ int symlink(const char *oldpath, const char *newpath)

return -1;
}

- if (!create_symbolic_link(newpath, oldpath, 0)) {
+ if (!create_symbolic_link(utf8_to_wchar(newpath),
utf8_to_wchar(oldpath), 0)) {
errno = err_win_to_posix(GetLastError());
return -1;
}

@@ -1374,31 +1509,11 @@ int symlink(const char *oldpath, const char
*newpath)

int readlink(const char *path, char *buf, size_t bufsiz)

{
- HANDLE handle = CreateFile(path, GENERIC_READ,
- FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
- NULL, OPEN_EXISTING,
- FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
- NULL);
-
- if (handle != INVALID_HANDLE_VALUE) {
- unsigned char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
- DWORD dummy = 0;

- if (DeviceIoControl(handle, FSCTL_GET_REPARSE_POINT, NULL, 0,
buffer,
- MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dummy, NULL)) {
- REPARSE_DATA_BUFFER *b = (REPARSE_DATA_BUFFER *) buffer;
- if (b->ReparseTag == IO_REPARSE_TAG_SYMLINK) {
- int len =
b->SymbolicLinkReparseBuffer.SubstituteNameLength / sizeof(wchar_t);
- int offset =
b->SymbolicLinkReparseBuffer.SubstituteNameOffset / sizeof(wchar_t);

- snprintf(buf, bufsiz, "%*ls", len, &
b->SymbolicLinkReparseBuffer.PathBuffer[offset]);
- CloseHandle(handle);
- return len;
- }
- }
-
- CloseHandle(handle);

+ wchar_t buffer[MAX_PATH];
+ int result = do_readlink(utf8_to_wchar(path), buffer, MAX_PATH);

+ if (result >= 0) {

+ return wcstombs(buf, buffer, bufsiz);

}
-
- errno = EINVAL;

return -1;
}

@@ -1417,64 +1532,80 @@ char *getpass(const char *prompt)

return strbuf_detach(&buf, NULL);
}

-#ifndef NO_MINGW_REPLACE_READDIR

+ wname[len] = 0;
+ }
+

+ dir->first = 1;
+ dir->handle = FindFirstFileW(wname, & dir->findbuf);
+ DWORD lasterr = GetLastError();

- if (!dir->dd_handle) {
- errno = EBADF; /* No set_errno for mingw */
+ if (dir->handle == INVALID_HANDLE_VALUE) {
+ errno = err_win_to_posix(lasterr);
+ free(dir);
return NULL;
}
+ return (DIR *) dir;
+}
+
+int mingw_closedir(DIR *indir)
+{
+ struct mingw_DIR *dir = (struct mingw_DIR *) indir;
+ HANDLE handle = dir->handle;
+ free(dir);
+
+ if (! FindClose(handle)) {
+ errno = err_win_to_posix(GetLastError());
+ return -1;
+ }

+ return 0;
+}
+

+ wcstombs(dir->entry.d_name, dir->findbuf.cFileName, FILENAME_MAX);

/* Set file type, based on WIN32_FIND_DATA */
- mdir->dd_dir.d_type = 0;
- if ((buf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
- (buf.dwReserved0 == IO_REPARSE_TAG_SYMLINK))
- mdir->dd_dir.d_type |= DT_LNK;
- else if (buf.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
- mdir->dd_dir.d_type |= DT_DIR;
+ dir->entry.d_type = 0;
+ if ((dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
+ (dir->findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK))
+ dir->entry.d_type |= DT_LNK;
+ else if (dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
+ dir->entry.d_type |= DT_DIR;
else
- mdir->dd_dir.d_type |= DT_REG;
+ dir->entry.d_type |= DT_REG;

- return (struct dirent*)&dir->dd_dir;
+ return & dir->entry;
}
-#endif // !NO_MINGW_REPLACE_READDIR

diff --git a/compat/mingw.h b/compat/mingw.h
index 4f9f7f4..b16eb6c 100644

--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -96,20 +96,6 @@ static inline int fcntl(int fd, int cmd, long arg)
* simple adaptors
*/

-static inline int mingw_mkdir(const char *path, int mode)
-{
- return mkdir(path);
-}
-#define mkdir mingw_mkdir
-
-static inline int mingw_unlink(const char *pathname)
-{
- /* read-only files cannot be removed */
- chmod(pathname, 0666);
- return unlink(pathname);
-}
-#define unlink mingw_unlink
-
static inline int waitpid(pid_t pid, int *status, unsigned options)
{
if (options == 0)

@@ -141,9 +127,28 @@ int readlink(const char *path, char *buf, size_t
bufsiz);
* replacements of existing functions
*/

-int mingw_open (const char *filename, int oflags, ...);
+int mingw_open(const char *filename, int oflags, ...);
#define open mingw_open

+FILE *mingw_fopen(const char *path, const char *mode);
+#define fopen mingw_fopen
+

+int mingw_access(const char *filename, int mode);
+#undef access
+#define access mingw_access
+
+int mingw_chdir(const char *path);
+#define chdir mingw_chdir
+
+int mingw_mkdir(const char *path, int mode);
+#define mkdir mingw_mkdir
+
+int mingw_rmdir(const char *path);
+#define rmdir mingw_rmdir
+
+int mingw_unlink(const char *file_name);
+#define unlink mingw_unlink
+
char *mingw_getcwd(char *pointer, int len);
#define getcwd mingw_getcwd

@@ -249,7 +254,6 @@ int main(int argc, const char **argv) \

} \
static int mingw_main(c,v)

-#ifndef NO_MINGW_REPLACE_READDIR
/*
* A replacement of readdir, to ensure that it reads the file type at
* the same time. This avoid extra unneeded lstats in git on MinGW

@@ -274,6 +278,12 @@ struct mingw_dirent

char d_name[FILENAME_MAX]; /* File name. */
};
#define dirent mingw_dirent
-#define readdir(x) mingw_readdir(x)
+
+DIR *mingw_opendir(const char *name);
+#define opendir mingw_opendir
+
+int mingw_closedir(DIR *dir);
+#define closedir mingw_closedir
+
struct dirent *mingw_readdir(DIR *dir);
-#endif // !NO_MINGW_REPLACE_READDIR

+#define readdir mingw_readdir
--
1.6.4.msysgit.0.2.ga5c1c.dirty

(and the clean version of readdir:)

struct dirent *mingw_readdir(DIR *indir)
{

struct mingw_DIR *dir = (struct mingw_DIR*)indir;

if (dir->first)
dir->first = 0;
else {

if (! FindNextFileW(dir->handle, & dir->findbuf)) {

DWORD lasterr = GetLastError();
if (lasterr != ERROR_NO_MORE_FILES)

errno = err_win_to_posix(lasterr);
return NULL;
}
}

wcstombs(dir->entry.d_name, dir->findbuf.cFileName, FILENAME_MAX);

/* Set file type, based on WIN32_FIND_DATA */

dir->entry.d_type = 0;

if ((dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&

(dir->findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK))
dir->entry.d_type |= DT_LNK;

else if (dir->findbuf.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)

dir->entry.d_type |= DT_DIR;
else

dir->entry.d_type |= DT_REG;

return & dir->entry;
}

Johannes Schindelin

unread,

Aug 11, 2009, 10:48:53 AM8/11/09

to Thorvald Natvig, msysGit

Hi,

On Tue, 11 Aug 2009, Thorvald Natvig wrote:

> > That is terrible. Just to make sure I understand correctly: when in a
> > UTF-8 directory, the _plain_ Windows chdir() and getcwd() (i.e. _not_
> > the compat/mingw.c versions, if any) misbehave?
> >
> > If so: how? Does getcwd() return the UTF-8 path, rather than using
> > the current codepage, or does chdir() expect something different?
>
> If you're in a directory which has unicode characters that can't be
> represented in the current codepage, the plain windows getcwd() still
> tries to represent every unicode symbol with one character, which
> becomes nonsense. And since it's nonsense, you can't chdir() to it.

Ah, now I got it!

> But as I had to override mkdir, rmdir etc anyway (since directory-names
> in the repository might have utf8), adding getcwd wasn't much work.

Good! You fixed the issue before I understood it; I like this.

> > Well, I guess it will become my itch.
>
> This only applies if you use characters outside your "local codepage".
> So your ß and Ü is representable; they do show up ;) This only becomes a
> problem when you use, e.g. japanese or chinese letters (which I what I
> tested with -- wikipedia has an abundance of examples).

That's a good idea!

> > I just remembered (sorry, I may have been indeed too tired last night)
> > that there is a function which does exactly what you want: wcstombs().
> > Its signature looks like this:
> >
> > size_t wcstombs(char *dest, const wchar_t *src, size_t n);
> >
> > It's part of the C99 standard, according to my documentation.
>
> And it's also in the Windows CRT, it seems. But it doesn't do UTF8 :(

Heh...

> I made a mingw_wcstombs and redirected with a #define.

Good!

> > > When I made "readlink" internal above, I made it specialcase the
> > > case where you don't pass the output buffer : only return the length
> > > of the link (which is what we need in lstat, we don't care what it
> > > points to).
> > >
> >
> > Okay...
> >
> And I reverted it again, because naturally, when you call lstat() and
> readlink(), you want them to return the number of utf8 characters, not the
> number of unicode characters, so you need the actual buffer so you can utf8
> convert it.
> My bad, fixed :)

Heh. You probably meant utf8 bytes, right? I confuse the things myself,
all the time. Bytes vs characters vs columns... (There are characters in
UTF-8 that use 2 columns...)

> > > Uhm. Just ignore that, I'm being silly. _wgetcwd() returns a wchar_t
> > > pointer, whereas 'getcwd' on *nix is supposed to return the length
> > > of the returned data.
> > >
> > > Oh, and I think this was the case I wrote the 'return length from utf8
> > > conversion' idea for.
> > >
> > Okay.
> >
> That's what I get for reading manpages at 2 AM.
> man 2 getcwd is the Linux kernel version, and does return a long.
> man 3 getcwd is the POSIX one, and does return a char * ;)

Hehehe... been there, done that!

> > > If you mean a runtime 'git --config' option, I haven't looked at
> > > adding config options, but if that is doable, we can simply modify
> > > the to/from unicode functions to use the GetACP() codepage instead
> > > of UTF8 if the config option "win32.noutf8" is true? Hm. Or we could
> > > use "i18n.commitEncoding" for a new and interesting purpose. Anyway,
> > > this is doable; we'd keep using wchar_t functions, we'd just change
> > > what codepage we translate to/from.
> > >
> >
> > That sounds sensible.
> >
> > For an example what I meant by "config option", please see:
> >
> > http://repo.or.cz/w/git/mingw/4msysgit.git?a=commitdiff;h=dc0019461c584bfed81485bb7dc6854d282f28b6
> >
> Ah, wonderful example :)

:-)

> Would you accept that as a 2nd commit, done on top of this one once if
> it is accepted?

Yes, that's how it is done in the hidedotdir branch, too.

> I have.. bad experience maintaining a patch series, I always end up
> invalidating the later patches, so I'd appreciate if we could do this in
> two steps ;)

Yeah.

I hope you use rebase -i, which helps me, at least, to keep my sanity with
ever-changing patch series.

> Updated patch attached. This does one further change; it uses
> FindFirstFileW for lstat() instead of GetFileAttributesW. FindFirstFile
> is actually marginally faster (measured by using 'time git status' on
> the full msysgit repo, and google indicates others have found the same
> result), and it makes the lstat() code to check for symlinks much
> cleaner.

That's a win-win.

> Most of the patch reads cleanly, except the rewrite of readdir, where I
> have hard time seeing the similarity between the diff and my code, so
> I've cut&paste'd a full copy of that function at the very end.

Thanks!

> Looking forward to more comments :)

Let's see... ;-)

> From 2241371fefd823e34bcf2f37bd140bb9ba7a0a72 Mon Sep 17 00:00:00 2001
> From: Thorvald Natvig <sli...@users.sourceforge.net>
> Date: Mon, 10 Aug 2009 23:44:10 +0200
> Subject: [PATCH 2/2] UTF8 Support

Is there a 1/2?

> ---
> compat/mingw.c | 389 +++++++++++++++++++++++++++++++++++++-------------------
> compat/mingw.h | 46 ++++---
> 2 files changed, 288 insertions(+), 147 deletions(-)

In the final version, I would love to have a little discussion of the
things you did, how, and why. Nothing real long, but for example an
explanation why we just do away with the *A function calls altogether, and
maybe briefly mentioning that we need 2 static buffers for one direction,
but none for the other direction.

> +/* UTF-8 aware wcstombs replacement. */

I'm goint go nit-picking mode for a second: for one-line comments, we
rarely add a full stop...

> +static inline size_t mingw_wcstombs(char *dst, const wchar_t *src, size_t
> length)
> +{
> + int result;
> +
> + if (!src)
> + return -1;
> + /* Only a length check? wcstombs indicates this with !dst,
> + * WideCharToMultiByte uses !length
> + */
> + if (!dst)
> + length = 0;
> + result = WideCharToMultiByte(CP_UTF8, 0, src, -1,
> + dst, length, NULL, NULL);
> + if (result > 0) {
> + if (length == 0)
> + return result-1;

Is this really > 0, not >=? Ah, yes, because you return result - 1.
Probably because it includes the trailing NUL, correct?

FWIW usually we prefer !length over length == 0...

> +
> + /* WideCharToMultiByte returns size with zero-terminator.
> + * wcstombs returns size without zero-terminator.
> + */
> + if (dst[result-1] == 0)
> + result--;
> + else if (result == length)
> + dst[length-1] = 0;
> + else
> + dst[result] = 0;
> +
> + return result;
> + }
> + error("Could not convert widechar '%ls' to utf8: %s", src,
> + result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
> + result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
> + "flags/parameter error");
> + return -1;
> +}
> +#undef wcstombs
> +#define wcstombs mingw_wcstombs
> +
> int mingw_open (const char *filename, int oflags, ...)
> {
> va_list args;
>

> [...]

>
> @@ -207,17 +334,17 @@ int mingw_lstat(const char *file_name, struct stat *buf)
> if (errno != ENOENT)
> return -1;
>
> - namelen = strlen(file_name);
> - if (namelen && file_name[namelen-1] != '/')
> + namelen = wcslen(file_name_w);
> +
> + if (namelen && file_name_w[namelen-1] != L'/')
> return -1;

Probably this is safe, although I suspect that there might be the
off-chance that lstat() is called with a path containing backslashes,
right?

> @@ -1417,64 +1532,80 @@ char *getpass(const char *prompt)
> return strbuf_detach(&buf, NULL);
> }
>
> -#ifndef NO_MINGW_REPLACE_READDIR

Funnily enough, there is no corresponding comment in the Makefile...

> -struct dirent *mingw_readdir(DIR *dir)
> +DIR *mingw_opendir(const char *name)

If you rename the parameter back to "dir", the --patience option should
yield a better diff...

This patch looks pretty good! Can you address the few remaining comments,
and push to 'mob'?

Thanks,
Dscho

Thorvald Natvig

unread,

Aug 11, 2009, 11:23:53 AM8/11/09

to Johannes Schindelin, msysGit

>>>
>>>
>> And I reverted it again, because naturally, when you call lstat() and
>> readlink(), you want them to return the number of utf8 characters, not the
>> number of unicode characters, so you need the actual buffer so you can utf8
>> convert it.
>> My bad, fixed :)
>>
>
> Heh. You probably meant utf8 bytes, right? I confuse the things myself,
> all the time. Bytes vs characters vs columns... (There are characters in
> UTF-8 that use 2 columns...)
>

Er. Yes, I do mean bytes. This really is confusing :)

>> From 2241371fefd823e34bcf2f37bd140bb9ba7a0a72 Mon Sep 17 00:00:00 2001
>> From: Thorvald Natvig <sli...@users.sourceforge.net>
>> Date: Mon, 10 Aug 2009 23:44:10 +0200
>> Subject: [PATCH 2/2] UTF8 Support
>>
>
> Is there a 1/2?
>

That would be the symlink patch. I simply did a 'git format-patch
origin/devel' ;)

> In the final version, I would love to have a little discussion of the
> things you did, how, and why. Nothing real long, but for example an
> explanation why we just do away with the *A function calls altogether, and
> maybe briefly mentioning that we need 2 static buffers for one direction,
> but none for the other direction.
>

I see there's an -F option to git commit, I guess I'll give it a try ;)

>> +/* UTF-8 aware wcstombs replacement. */
>>
>
> I'm goint go nit-picking mode for a second: for one-line comments, we
> rarely add a full stop...
>

Ok

>> +static inline size_t mingw_wcstombs(char *dst, const wchar_t *src, size_t
>> length)
>> +{
>> + int result;
>> +
>> + if (!src)
>> + return -1;
>> + /* Only a length check? wcstombs indicates this with !dst,
>> + * WideCharToMultiByte uses !length
>> + */
>> + if (!dst)
>> + length = 0;
>> + result = WideCharToMultiByte(CP_UTF8, 0, src, -1,
>> + dst, length, NULL, NULL);
>> + if (result > 0) {
>> + if (length == 0)
>> + return result-1;
>>
>
> Is this really > 0, not >=? Ah, yes, because you return result - 1.
> Probably because it includes the trailing NUL, correct?
>
> FWIW usually we prefer !length over length == 0...
>

Yes, as the comment a bit further down explains, WideCharToMultiByte
always includes the trailing NUL, whereas all other string functions
don't. Of course, if there wasn't space for the trailing NUL, it won't
be included, which is the reason for all the odd checks.

>> @@ -207,17 +334,17 @@ int mingw_lstat(const char *file_name, struct stat *buf)
>> if (errno != ENOENT)
>> return -1;
>>
>> - namelen = strlen(file_name);
>> - if (namelen && file_name[namelen-1] != '/')
>> + namelen = wcslen(file_name_w);
>> +
>> + if (namelen && file_name_w[namelen-1] != L'/')
>> return -1;
>>
>
> Probably this is safe, although I suspect that there might be the
> off-chance that lstat() is called with a path containing backslashes,
> right?
>

getcwd() converts them to /, and git only adds '/' internally, so, in
theory, it shouldn't happen.
To be honest, I didn't find it ever trying to lstat() something that had
a trailing / either, so the check might be redundant.
Anyway, the code was already there in that fashion, the only change I've
done is to make it wchar_t :)

>> -#ifndef NO_MINGW_REPLACE_READDIR
>>
>
> Funnily enough, there is no corresponding comment in the Makefile...
>

No, the only place that define is mentioned is when it's #ifndef'd. So I
thought it was pretty safe to remove it.

>> -struct dirent *mingw_readdir(DIR *dir)
>> +DIR *mingw_opendir(const char *name)
>>
> If you rename the parameter back to "dir", the --patience option should
> yield a better diff...
>

I'm not sure I understand, you mean rename 'name' to 'dir' for
mingw_opendir()? I left it alone for now.

> This patch looks pretty good! Can you address the few remaining comments,
> and push to 'mob'?
>

Pushed as 45867beb1b85a258482431cff2d8268fea4bc7ca. I really do suck at
writing sensible commit messages, so feel free to edit that one if you
wish :)

Best Regards,
Thorvald

Johannes Schindelin

unread,

Aug 11, 2009, 12:11:40 PM8/11/09

to Thorvald Natvig, msysGit

Hi,

On Tue, 11 Aug 2009, Thorvald Natvig wrote:

> > > -struct dirent *mingw_readdir(DIR *dir)
> > > +DIR *mingw_opendir(const char *name)
> > >
> > If you rename the parameter back to "dir", the --patience option
> > should yield a better diff...
> >
> I'm not sure I understand, you mean rename 'name' to 'dir' for
> mingw_opendir()? I left it alone for now.

No, I meant the parameter "indir" of your version of mingw_readdir(). We
have a patience diff which you can switch on using --patience with all Git
programs that take diff options (such as format-patch). This will
identify uniquely identical lines (i.e. matching lines that are unique in
both versions), so if you convert the parameter name back to what it was
before, patience diff will _force_ that to be a common line.

But I see that you use the name "dir" in the function body already.

> > This patch looks pretty good! Can you address the few remaining
> > comments, and push to 'mob'?
>
> Pushed as 45867beb1b85a258482431cff2d8268fea4bc7ca. I really do suck at
> writing sensible commit messages, so feel free to edit that one if you
> wish :)

I pushed something new on top of the work/symlink branch that I will merge
into devel in the next few days unless I hear shouts of outrage.

Please check it.

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 11, 2009, 2:29:31 PM8/11/09

to Johannes Schindelin, msysGit

Johannes Schindelin wrote:
>
>> I'm not sure I understand, you mean rename 'name' to 'dir' for
>> mingw_opendir()? I left it alone for now.
>>
>
> No, I meant the parameter "indir" of your version of mingw_readdir(). We
> have a patience diff which you can switch on using --patience with all Git
> programs that take diff options (such as format-patch). This will
> identify uniquely identical lines (i.e. matching lines that are unique in
> both versions), so if you convert the parameter name back to what it was
> before, patience diff will _force_ that to be a common line.
>
> But I see that you use the name "dir" in the function body already.
>

Aha, then I understand :)

>> Pushed as 45867beb1b85a258482431cff2d8268fea4bc7ca. I really do suck at
>> writing sensible commit messages, so feel free to edit that one if you
>> wish :)
>>
>
> I pushed something new on top of the work/symlink branch that I will merge
> into devel in the next few days unless I hear shouts of outrage.
>
> Please check it.
>

Looks good to me. I'll use that branch as a base for adding the config
option then; should have it ready in an hour or so.

Best Regards,
Thorvald

Thorvald Natvig

unread,

Aug 11, 2009, 3:45:09 PM8/11/09

to Johannes Schindelin, msysGit

Option patch applied.

Note that there's a tiny bootstrapping problem. If you have
core.utf8filenames = false in your git config file, that file is only
read after git has checked cwd() etc. Which means we have the cwd stored
in utf8, and we just switched the config to non-utf8. I only encountered
this if you have non-ascii characters in the cwd when you run git; if
you run git from the base directory (which hopefully doesn't have
non-ascii chars), everything works as expected. And of course everything
works fine if you use utf8 all the way (which is the default).
For people that must run git from a non-ascii base path, there's a new
#define you can use.

I first tried using the "original" wcstombs(), but they behave
differently than what the 'A' filesystem functions do (untranslatable
characters should become literal '?', not give errors).
Using WideCharToMultiByte with the local ANSI codepage, I get the exact
same result from e.g. readdir() as what you get with the 'A' functions.

Best Regards,
Thorvald

From b410c03a734dbeef8ee5a8aa5c8e0ef987ceeecb Mon Sep 17 00:00:00 2001
From: Thorvald Natvig <sli...@users.sourceforge.net>
Date: Tue, 11 Aug 2009 21:28:53 +0200
Subject: [PATCH] Add core.utf8filenames option

If "true"(default), will store file and directory names in utf8 in the
index. If "false", will use the local ANSI codepage. Note that, if false,
pathnames that can't be represented in the ANSI codepage will fail with
weird results.

If the current working directory has non-ASCII characters in it, the calls
to getcwd() done before the config is read means some operations can fail.
To work around this, there is an optional define NO_UTF8_FILENAMES which
will make the default be "false".
---
Makefile | 3 +++
cache.h | 1 +
compat/mingw.c | 18 +++++++++++++-----
config.c | 5 +++++
environment.c | 5 +++++
5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index 690ac55..b6c3b40 100644
--- a/Makefile
+++ b/Makefile
@@ -198,6 +198,9 @@ all::
# memory allocators with the nedmalloc allocator written by Niall Douglas.
#
# Define NO_REGEX if you have no or inferior regex support in your C
library.
+#
+# Define NO_UTF8_FILENAMES if you want the default for
core.utf8filenames to be
+# false. Necesarry if the current directory has non-ascii characters.

GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
@$(SHELL_PATH) ./GIT-VERSION-GEN
diff --git a/cache.h b/cache.h
index e6c7f33..07c126f 100644
--- a/cache.h
+++ b/cache.h
@@ -518,6 +518,7 @@ extern size_t delta_base_cache_limit;
extern int auto_crlf;
extern int fsync_object_files;
extern int core_preload_index;
+extern int utf8_filenames;

enum safe_crlf {
SAFE_CRLF_FALSE = 0,
diff --git a/compat/mingw.c b/compat/mingw.c
index abfb40e..129dcb9 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -129,6 +129,7 @@ static inline wchar_t *utf8_to_wchar(const char *src)

static wchar_t buffer[4][PATH_MAX+4];

static int counter = 0;

int result;
+ extern int utf8_filenames;

if (!src)
return NULL;
@@ -136,8 +137,8 @@ static inline wchar_t *utf8_to_wchar(const char *src)
if (++counter >= ARRAY_SIZE(buffer))
counter = 0;

- result = MultiByteToWideChar(CP_UTF8, 0, src, -1,
- buffer[counter], PATH_MAX);
+ result = MultiByteToWideChar(utf8_filenames ? CP_UTF8 : CP_ACP,
+ 0, src, -1, buffer[counter], PATH_MAX);
if (result >= 0)
return buffer[counter];

error("Could not convert utf8 '%s' to widechars: %s", src,

@@ -147,9 +148,15 @@ static inline wchar_t *utf8_to_wchar(const char *src)
return NULL;
}

-/* UTF-8 aware wcstombs replacement */
+/*
+ * UTF-8 aware wcstombs replacement
+ * For non-utf8 locales, this does the same as the windows filesystem;
+ * replace unknown unicode characters with '?' instead of failing to
+ * convert the string.
+ */

static inline size_t mingw_wcstombs(char *dst, const wchar_t *src,
size_t length)
{

+ extern int utf8_filenames;
int result;

if (!src)
@@ -160,8 +167,8 @@ static inline size_t mingw_wcstombs(char *dst, const
wchar_t *src, size_t length
*/
if (!dst)
length = 0;
- result = WideCharToMultiByte(CP_UTF8, 0, src, -1,
- dst, length, NULL, NULL);
+ result = WideCharToMultiByte(utf8_filenames ? CP_UTF8 : CP_ACP,
+ 0, src, -1, dst, length, NULL, NULL);
if (result > 0) {
if (! length)
return result-1;
@@ -1573,6 +1580,7 @@ DIR *mingw_opendir(const char *name)

free(dir);
return NULL;
}
+
return (DIR *) dir;
}

diff --git a/config.c b/config.c
index 738b244..6be7482 100644
--- a/config.c
+++ b/config.c
@@ -505,6 +505,11 @@ static int git_default_core_config(const char *var,
const char *value)
return 0;
}

+ if (!strcmp(var, "core.utf8filenames")) {
+ utf8_filenames = git_config_bool(var, value);

+ return 0;
+ }
+

/* Add other config variables here and to Documentation/config.txt. */
return 0;
}
diff --git a/environment.c b/environment.c
index 8f5eaa7..e13fa05 100644
--- a/environment.c
+++ b/environment.c
@@ -48,6 +48,11 @@ enum push_default_type push_default =
PUSH_DEFAULT_MATCHING;
#endif
enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
int grafts_replace_parents = 1;
+#ifdef NO_UTF8_FILENAMES
+int utf8_filenames = 0;
+#else
+int utf8_filenames = 1;
+#endif

/* Parallel index stat data preload? */
int core_preload_index = 0;
--
1.6.4.msysgit.0.3.g7fe3.dirty

Johannes Sixt

unread,

Aug 11, 2009, 5:27:25 PM8/11/09

to Thorvald Natvig, msy...@googlegroups.com, Johannes Schindelin

On Dienstag, 11. August 2009, Thorvald Natvig wrote:
> +/* UTF-8 aware wcstombs replacement. */
> +static inline size_t mingw_wcstombs(char *dst, const wchar_t *src,
> size_t length)

No, this is not a replacement of wcstombs because it does not obey the locale.
Please give it a different name.

> +{
> + int result;
> +
> + if (!src)
> + return -1;
> + /* Only a length check? wcstombs indicates this with !dst,
> + * WideCharToMultiByte uses !length
> + */
> + if (!dst)
> + length = 0;
> + result = WideCharToMultiByte(CP_UTF8, 0, src, -1,
> + dst, length, NULL, NULL);
> + if (result > 0) {
> + if (length == 0)
> + return result-1;
> +
> + /* WideCharToMultiByte returns size with zero-terminator.
> + * wcstombs returns size without zero-terminator.
> + */
> + if (dst[result-1] == 0)
> + result--;
> + else if (result == length)
> + dst[length-1] = 0;

This is wrong, according to my man page of wcstmbs: If the buffer was too
short, then destination is not NUL terminated. But it doesn't matter because
this isn't wcstombs anyway ;-)

> + else
> + dst[result] = 0;
> +
> + return result;
> + }
> + error("Could not convert widechar '%ls' to utf8: %s", src,
> + result == ERROR_INSUFFICIENT_BUFFER ? "name too long" :
> + result == ERROR_NO_UNICODE_TRANSLATION ? "unicode lacking" :
> + "flags/parameter error");
> + return -1;
> +}

...

Repeat after me:

I SHALL NEVER USE FindFirstFile's TIMESTAMPS.

You absolutely MUST use GetFileAttributesEx's timestamps. Read

http://kobesearch.cpan.org/htdocs/Win32-UTCFileTime/Win32/UTCFileTime.html

if you want to know more.

I see a different problem with your approach:

Let's assume that I am in codepage 1252 (latin1) and I have a file
named 'süß'.

What will git do if I type this in CMD:

git add süß

I guess it won't do the expected because the byte sequence 's\xfc\xdf' (which
was recieved via argv[]) is not correct UTF-8.

-- Hannes

Thorvald Natvig

unread,

Aug 11, 2009, 8:45:31 PM8/11/09

to Johannes Sixt, msy...@googlegroups.com, Johannes Schindelin

> No, this is not a replacement of wcstombs because it does not obey the locale.
> Please give it a different name.
>

Suggestions, then?

>
> Repeat after me:
>
> I SHALL NEVER USE FindFirstFile's TIMESTAMPS.
>
> You absolutely MUST use GetFileAttributesEx's timestamps. Read
>
> http://kobesearch.cpan.org/htdocs/Win32-UTCFileTime/Win32/UTCFileTime.html
>
> if you want to know more.
>

That document seems to point out that, to get uniform timestamps no
matter the timezone, on both FAT and NTFS, you need to open the file and
use GetInformationByFileHandle(). While I haven't checked, I think that
would make lstat() rather slow.
It also points out that this really is only a problem on FAT volumes.

I'm assuming the issue you are trying to make is that FindFirstFile uses
cached timestamps from the directory entry instead of from the file
entry in the MFT, and hence may not be completely up to date? If so,
that's (reputedly) been fixed for some time now, and at least works fine
here on Vista. But if it wasn't fixed until Vista (I don't have an XP
machine to check), we can't require Vista, and hence it should probably
be changed back.

I'll assume it's the latter that's the problem. If that's wrong, please
let me know.

> I see a different problem with your approach:
>
> Let's assume that I am in codepage 1252 (latin1) and I have a file
> named 'süß'.
>
> What will git do if I type this in CMD:
>
> git add süß
>
> I guess it won't do the expected because the byte sequence 's\xfc\xdf' (which
> was recieved via argv[]) is not correct UTF-8.
>

Correct, and I also commented on this in one of the mails. To get this
right, we'll have to modify the main() in mingw.h to call
GetCommandLineW, CommandLineToArgvW, and then convert each argument to
utf8. Or we could try using wmain() (I haven't done so before, but it
looks sane enough on MSDN at least). However, that will still only work
for files in your local codepage, as cmd.exe (and msys' bash) can't pass
arguments outside the local codepage. So you can never add files with
e.g. chinese charaters. To do that, you need to 'git add <containing
directory>'.
Making this change would mean applications that simply exec git would
work with any filename, as long as they passed wchar_t arguments to
CreateProcess, so it's probably a good change.
I can make a patch to do this after we've agreed on this one :)

Best Regards,
Thorvald

Peter Harris

unread,

Aug 11, 2009, 11:36:00 PM8/11/09

to Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com, Johannes Schindelin

On Tue, Aug 11, 2009 at 8:45 PM, Thorvald Natvig wrote:
>
>> I see a different problem with your approach:
>> Let's assume that I am in codepage 1252 (latin1) and I have a file named
>> 'süß'.
>>
>> What will git do if I type this in CMD:
>>
>> git add süß
>>
>> I guess it won't do the expected because the byte sequence 's\xfc\xdf'
>> (which was recieved via argv[]) is not correct UTF-8.
>>
>
> Correct, and I also commented on this in one of the mails. To get this
> right, we'll have to modify the main() in mingw.h to call GetCommandLineW,
> CommandLineToArgvW, and then convert each argument to utf8. Or we could try
> using wmain() (I haven't done so before, but it looks sane enough on MSDN at
> least). However, that will still only work for files in your local codepage,
> as cmd.exe (and msys' bash) can't pass arguments outside the local codepage.

I haven't tested bash, but cmd.exe passes UTF-16 characters just fine.
Even when you're not using a Unicode font.

I tested this when working on
http://lists.xiph.org/pipermail/commits/2008-September/014115.html --
using tab completion, Chinese glyphs show up as ???? on the (en_US)
command line, but the correct UTF-16 is retrieved by GetCommandLineW.

Peter Harris

Marius Storm-Olsen

unread,

Aug 12, 2009, 3:35:25 AM8/12/09

to Thorvald Natvig, Johannes Schindelin, msysGit

> + result = MultiByteToWideChar(CP_UTF8, 0, src, -1,
> + buffer[counter], PATH_MAX);

I see you're using Windows own conversions for UTF-8, are you sure
this works properly? I originally thought we had to take advantage of
the iconv lib, due to the following issue on Windows:

http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100056

Though, I guess it's only a locale issue, and Win32 NLS API & MBTWC
does the proper UTF conversions?

--
.marius

Thorvald Natvig

unread,

Aug 12, 2009, 8:57:39 AM8/12/09

to Marius Storm-Olsen, Johannes Schindelin, msysGit

MultiByteToWideChar doesn't use locale, unless you specify 'CP_ACP'
(where it uses the program ANSI page, correct for the filesystem) or
'CP_OEMCP' (the OEM codepage, which gives correct output in the
console). For example, if you take the string L"pølse", the CP_ACP will
make it "p°lse" if you put it through printf(), but that's actually the
correct filename for non-wchar_t file functions. If you convert with
CP_OEMCP, you'll get "pølse" if you print it on the screen, but that
file can't be found.

Anyway, the goal was UTF8, and with the UTF-8 specific page, which can
only be handled by MultiByteToWideChar (and WideCharToMultiByte), it did
successfully convert the chinese testcharacters I used back and forth,
including doing it correctly when cloning from a unix repository.

Thorvald Natvig

unread,

Aug 12, 2009, 8:59:25 AM8/12/09

to Peter Harris, Johannes Sixt, msy...@googlegroups.com, Johannes Schindelin

Peter Harris wrote:
>
> I haven't tested bash, but cmd.exe passes UTF-16 characters just fine.
> Even when you're not using a Unicode font.
>
> I tested this when working on
> http://lists.xiph.org/pipermail/commits/2008-September/014115.html --
> using tab completion, Chinese glyphs show up as ???? on the (en_US)
> command line, but the correct UTF-16 is retrieved by GetCommandLineW.
>

Excellent news :)
This should be a trivial fix then, once the rest of the patches are
accepted.

Thorvald Natvig

unread,

Aug 12, 2009, 9:26:52 AM8/12/09

to Marius Storm-Olsen, Johannes Schindelin, msysGit

> MultiByteToWideChar doesn't use locale, unless you specify 'CP_ACP'
> (where it uses the program ANSI page, correct for the filesystem) or
> 'CP_OEMCP' (the OEM codepage, which gives correct output in the
> console). For example, if you take the string L"pølse", the CP_ACP
> will make it "p°lse" if you put it through printf(), but that's
> actually the correct filename for non-wchar_t file functions. If you
> convert with CP_OEMCP, you'll get "pølse" if you print it on the
> screen, but that file can't be found.

Ehm, that specific example is for WideCharToMultiByte, not the other way
around. Sorry about that :)
Anyway, if you want "same filenames as in 'A' functions", you need to
use the ACP, and the OEMCP for printing to screen. cmd also runs with a
different codepage than most GUI applications (at least it does so
here), so the byte<=>glyph mapping differs.
So things are actually much more sane when we map everything to utf8 in
the index, and use the wchar_t equivalent for the file operations.

Johannes Sixt

unread,

Aug 12, 2009, 3:49:00 PM8/12/09

to Thorvald Natvig, msy...@googlegroups.com, Johannes Schindelin

Thorvald Natvig schrieb:

>
>> No, this is not a replacement of wcstombs because it does not obey the
>> locale. Please give it a different name.
>>
> Suggestions, then?

I already did: give it a different name; for example, wcstoutf8.

>> Repeat after me:
>>
>> I SHALL NEVER USE FindFirstFile's TIMESTAMPS.
>>
>> You absolutely MUST use GetFileAttributesEx's timestamps. Read
>>
>> http://kobesearch.cpan.org/htdocs/Win32-UTCFileTime/Win32/UTCFileTime.html
>>
>>
>> if you want to know more.
>>
> That document seems to point out that, to get uniform timestamps no
> matter the timezone, on both FAT and NTFS, you need to open the file and
> use GetInformationByFileHandle(). While I haven't checked, I think that
> would make lstat() rather slow.
> It also points out that this really is only a problem on FAT volumes.

This is perhaps not so important.

> I'm assuming the issue you are trying to make is that FindFirstFile uses
> cached timestamps from the directory entry instead of from the file
> entry in the MFT, and hence may not be completely up to date?

But this is the important issue.

> If so,
> that's (reputedly) been fixed for some time now, and at least works fine
> here on Vista. But if it wasn't fixed until Vista (I don't have an XP
> machine to check), we can't require Vista, and hence it should probably
> be changed back.

I have no clue whether this is fixed on XP.

>> What will git do if I type this in CMD:
>>
>> git add süß
>>
>> I guess it won't do the expected because the byte sequence 's\xfc\xdf'
>> (which was recieved via argv[]) is not correct UTF-8.
>>
> Correct, and I also commented on this in one of the mails. To get this
> right, we'll have to modify the main() in mingw.h to call
> GetCommandLineW, CommandLineToArgvW, and then convert each argument to
> utf8. Or we could try using wmain() (I haven't done so before, but it
> looks sane enough on MSDN at least).

This sounds more and more like writing a new libc with startup code and all...

And then, what do you do with all the script code? You can't pass
arbitrary UTF8 file names through scripts, or does MSYS use UTF8
internally? I think not.

All this is definitely outside my realm.

-- Hannes

Thorvald Natvig

unread,

Aug 12, 2009, 5:57:12 PM8/12/09

to Johannes Sixt, msy...@googlegroups.com, Johannes Schindelin

Johannes Sixt wrote:
>>>
>>> git add süß
>>>
>>> I guess it won't do the expected because the byte sequence
>>> 's\xfc\xdf' (which was recieved via argv[]) is not correct UTF-8.
>>>
>> Correct, and I also commented on this in one of the mails. To get
>> this right, we'll have to modify the main() in mingw.h to call
>> GetCommandLineW, CommandLineToArgvW, and then convert each argument
>> to utf8. Or we could try using wmain() (I haven't done so before, but
>> it looks sane enough on MSDN at least).
>
> This sounds more and more like writing a new libc with startup code
> and all...
>
> And then, what do you do with all the script code? You can't pass
> arbitrary UTF8 file names through scripts, or does MSYS use UTF8
> internally? I think not.
>
> All this is definitely outside my realm.

No, script code will still be confused. The only way to get all of it
right is, indeed, to create a new libc which uses UTF8 for filenames all
over. However, mingw uses file functions from msvcrt, which I don't
think we have the source for, and hence we can't easily modify it. So
the solution would be to create a new library, 'libmingwutf8' or
something like that, which had new file functions which chained to the
_w version in MSVCRT. If linked in before MSVCRT, file functions would
bind there. However, I think doing so is a major project (there are a
LOT more file functions that just what git uses), and probably outside
the scope of msysgit.

My goal is primarily to have something which allows you to clone a
project containing utf8 file and directory names, check out a branch,
modify the files and commit the changes, ie; basic stuff. And even if
the scripts etc don't work, the patch, as presented, is IMHO better than
checking files out with the completely wrong filename.

Best Regards,
Thorvald

Johannes Schindelin

unread,

Aug 17, 2009, 7:55:09 PM8/17/09

to Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

Hi,

On Wed, 12 Aug 2009, Thorvald Natvig wrote:

> [...] The only way to get all of it right is, indeed, to create a new

> libc which uses UTF8 for filenames all over. However, mingw uses file
> functions from msvcrt, which I don't think we have the source for, and
> hence we can't easily modify it. So the solution would be to create a
> new library, 'libmingwutf8' or something like that, which had new file
> functions which chained to the _w version in MSVCRT. If linked in before
> MSVCRT, file functions would bind there. However, I think doing so is a
> major project (there are a LOT more file functions that just what git
> uses), and probably outside the scope of msysgit.
>
> My goal is primarily to have something which allows you to clone a
> project containing utf8 file and directory names, check out a branch,
> modify the files and commit the changes, ie; basic stuff. And even if
> the scripts etc don't work, the patch, as presented, is IMHO better than
> checking files out with the completely wrong filename.

Well, I just merged your symlink work. This goes also to show you how
much I value your work.

Now let's think about things for a little bit, and decide what needs to be
done with the UTF-8 stuff.

The outcome could very well be to go all MinGW. Which is a major effort,
but Bosko and Erik are already on it, so it could not hurt if we helped
them, right?

Then we could just deprecate non-ASCII file names with MSys (read: Git
Bash and the shell scripts, since there is nothing else left that relies
on MSys), and be happy!

Ciao,
Dscho

Erik Faye-Lund

unread,

Aug 17, 2009, 8:18:25 PM8/17/09

to Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 1:55 AM, Johannes
Schindelin<Johannes....@gmx.de> wrote:
> Now let's think about things for a little bit, and decide what needs to be
> done with the UTF-8 stuff.
>
> The outcome could very well be to go all MinGW. Which is a major effort,
> but Bosko and Erik are already on it, so it could not hurt if we helped
> them, right?

Wait, I'm working on what? :)

Are you thinking about the stab I had at Issue 286 (Case Sensitity of
Directory Names)? I guess it kinda touches some of the same areas as
UTF-8.

However, the issue kind of scares me. At least from the quick research
I did, it seemed like the "proper" solution to it would be something
like Linus' suggestion to keep two strings in the index - one git-path
and one OS-canonical path. And that'd be a huge, tedious and intrusive
patch series ;)

The alternative (for issue 286, not UTF-8 support) is something like a
polished version of the patch I sent, which seems a bit nasty, and I'd
have to read a lot more code to feel confident about it. But it's a
lot less work than the change-the-way-the-entire-index-works thingie,
at least. And it might even help MacOSX-people.

But does it make UTF-8 support trickier to implement? How well does
strnicmp() work with UTF-8 strings? Some Google searching indicates
that it doesn't work too well...

--
Erik "kusma" Faye-Lund
kusm...@gmail.com
(+47) 986 59 656

Johannes Schindelin

unread,

Aug 17, 2009, 8:22:53 PM8/17/09

to Erik Faye-Lund, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

Hi,

On Tue, 18 Aug 2009, Erik Faye-Lund wrote:

> On Tue, Aug 18, 2009 at 1:55 AM, Johannes
> Schindelin<Johannes....@gmx.de> wrote:
> > Now let's think about things for a little bit, and decide what needs
> > to be done with the UTF-8 stuff.
> >
> > The outcome could very well be to go all MinGW. Which is a major
> > effort, but Bosko and Erik are already on it, so it could not hurt if
> > we helped them, right?
>
> Wait, I'm working on what? :)

Well, you helped getting a MinGW mstmp.exe. And we discussed (briefly)
the opportunities of MinGW Perl, and how to make the rest of Git
independent of MSys.

> Are you thinking about the stab I had at Issue 286 (Case Sensitity of
> Directory Names)? I guess it kinda touches some of the same areas as
> UTF-8.
>
> However, the issue kind of scares me. At least from the quick research
> I did, it seemed like the "proper" solution to it would be something
> like Linus' suggestion to keep two strings in the index - one git-path
> and one OS-canonical path. And that'd be a huge, tedious and intrusive
> patch series ;)
>
> The alternative (for issue 286, not UTF-8 support) is something like a
> polished version of the patch I sent, which seems a bit nasty, and I'd
> have to read a lot more code to feel confident about it. But it's a
> lot less work than the change-the-way-the-entire-index-works thingie,
> at least. And it might even help MacOSX-people.
>
> But does it make UTF-8 support trickier to implement? How well does
> strnicmp() work with UTF-8 strings? Some Google searching indicates
> that it doesn't work too well...

My comment was not primarily on UTF-8, but that we do not need to take
care (at least not that much) of MSys.

But I realize that the original comment was about the libc (which is used
even by MinGW).

Hmm.

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 17, 2009, 8:37:10 PM8/17/09

to Johannes Schindelin, Johannes Sixt, msy...@googlegroups.com

Johannes Schindelin wrote:
>
> Well, I just merged your symlink work. This goes also to show you how
> much I value your work.
>

Yay :) Thanks.

> Now let's think about things for a little bit, and decide what needs to be
> done with the UTF-8 stuff.
>
> The outcome could very well be to go all MinGW. Which is a major effort,
> but Bosko and Erik are already on it, so it could not hurt if we helped
> them, right?
>
> Then we could just deprecate non-ASCII file names with MSys (read: Git
> Bash and the shell scripts, since there is nothing else left that relies
> on MSys), and be happy!
>

I agree that fixing it in MinGW/MSys somewhere would probably be ideal,
but I don't really know MSys well enough to have a clear grasp of where
to begin. But if an overall strategy has already been found, I'll help.

Best Regards,
Thorvald

Erik Faye-Lund

unread,

Aug 17, 2009, 8:50:04 PM8/17/09

to Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 2:22 AM, Johannes
Schindelin<Johannes....@gmx.de> wrote:
>> Wait, I'm working on what? :)
>
> Well, you helped getting a MinGW mstmp.exe. And we discussed (briefly)
> the opportunities of MinGW Perl, and how to make the rest of Git
> independent of MSys.

Oh, right. I completely misunderstood you :)

> My comment was not primarily on UTF-8, but that we do not need to take
> care (at least not that much) of MSys.

Yeah, I see that now :)

> But I realize that the original comment was about the libc (which is used
> even by MinGW).
>
> Hmm.

Actually, I started to think a bit about UTF-8 and case preserving
file systems, and got really scared. If I've understood it correctly,
we absolutely need to know the filename encoding to be able to do case
insensitive filename comparision. This might not be an issue on it's
own, but it seems to me like slow_same_name() is broken WRT non-ASCII
string comparison. Just calling toupper() on each byte of an UTF-8
string won't cut it for comparison. MSDN clearly documents that
toupper() only works correctly with ASCII characters - and on top of
that, each UTF-8 code-point might be up to 6 bytes big. Converting
each string to UCS-4 or something to avoid multi-byte special cases
before comparing might be a solution, but it doesn't sound very
appealing to me to do that all the time.

Luckily it's way past my bed-time, and hopefully someone will point
out some error in my thinking by the time I wake up. Ah, how nice that
would be.

Johannes Schindelin

unread,

Aug 18, 2009, 7:52:30 AM8/18/09

to Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

Hi,

On Tue, 18 Aug 2009, Thorvald Natvig wrote:

> Johannes Schindelin wrote:
> >
> > Now let's think about things for a little bit, and decide what needs
> > to be done with the UTF-8 stuff.
> >
> > The outcome could very well be to go all MinGW. Which is a major
> > effort, but Bosko and Erik are already on it, so it could not hurt if
> > we helped them, right?
> >
> > Then we could just deprecate non-ASCII file names with MSys (read: Git
> > Bash and the shell scripts, since there is nothing else left that
> > relies on MSys), and be happy!
>
> I agree that fixing it in MinGW/MSys somewhere would probably be ideal,
> but I don't really know MSys well enough to have a clear grasp of where
> to begin. But if an overall strategy has already been found, I'll help.

The main problem are the shell/Perl scripts, right?

I could imagine that

http://repo.or.cz/w/msys.git?a=blob;f=msys/rt/src/newlib/libc/syscalls/sysopen.c;h=6b3836f246e76904dbfa880ace47b39688222f52;hb=32b6996f5999117a7d8724d6883ff51a13b69841

and

http://repo.or.cz/w/msys.git?a=blob;f=msys/rt/src/newlib/libc/stdio/fopen.c;h=7d683673946ce40473f221fd4eb3bace3b63e7b3;hb=32b6996f5999117a7d8724d6883ff51a13b69841

should get you started (both links are to source code that compiles to
msys-1.0.dll, e.g. by running /src/rt/release.sh in the 'msys' branch).

Ciao,
Dscho

Thorvald Natvig

unread,

Aug 18, 2009, 8:39:46 AM8/18/09

to Johannes Schindelin, Johannes Sixt, msy...@googlegroups.com

Johannes Schindelin wrote:
>
>> I agree that fixing it in MinGW/MSys somewhere would probably be ideal,
>> but I don't really know MSys well enough to have a clear grasp of where
>> to begin. But if an overall strategy has already been found, I'll help.
>>
>
> The main problem are the shell/Perl scripts, right?
>
> I could imagine that
>
> http://repo.or.cz/w/msys.git?a=blob;f=msys/rt/src/newlib/libc/syscalls/sysopen.c;h=6b3836f246e76904dbfa880ace47b39688222f52;hb=32b6996f5999117a7d8724d6883ff51a13b69841
>
> and
>
> http://repo.or.cz/w/msys.git?a=blob;f=msys/rt/src/newlib/libc/stdio/fopen.c;h=7d683673946ce40473f221fd4eb3bace3b63e7b3;hb=32b6996f5999117a7d8724d6883ff51a13b69841
>
> should get you started (both links are to source code that compiles to
> msys-1.0.dll, e.g. by running /src/rt/release.sh in the 'msys' branch).
>

But git itself doesn't link msys-1.0.dll, does it? Meaning we'd need an
alternate solution for it. I'd really prefer to use the "same" solution
for git, bash and perl, as I'm a bit concerned we'll get bit rot if we
don't. This is also a problem for quite a few other applications that
originated on Linux; they assume the filesystem is UTF8. I saw a bug
report for PHP-on-Win32, there's probably others as well. And sooner or
later, someone will use a thirdparty application together with git and
wonder why things don't do what they expect.

There's an alternative strategy; create an injection library which
replaces kernel32.dll's CreateFileA, FindFirstFileA etc with wrappers
around the W variants with suitable UTF8 encoding/decoding. That
wouldn't need modifying of the C library, and it could be conditionally
included in git; simply LoadLibrary "makeutf8.dll" to enable the
injection. Then the standard char* C library functions would be "fixed"
(as they all call a kernel32.dll function in the end), and we would, for
all practical purposes, have a OS that stores filenames as UTF8. If you
call open, fopen, _open, _fopen; it doesn't matter, it all ends up in
CreateFileA() in the end. Our wrapper would simply call
MultiByteToWideChar, then chain to CreateFileW with all other parameters
untouched. (Similar for other functions).
This would work for third-party programs too ;)

It's a hack on par with using LD_PRELOAD though, so I'm not sure how
acceptable it would be.

Best Regards,
Thorvald

Peter Harris

unread,

Aug 18, 2009, 11:17:10 AM8/18/09

to Erik Faye-Lund, Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Mon, Aug 17, 2009 at 8:50 PM, Erik Faye-Lund wrote:
>
> Actually, I started to think a bit about UTF-8 and case preserving
> file systems, and got really scared. If I've understood it correctly,
> we absolutely need to know the filename encoding to be able to do case
> insensitive filename comparision. This might not be an issue on it's
> own, but it seems to me like slow_same_name() is broken WRT non-ASCII
> string comparison. Just calling toupper() on each byte of an UTF-8
> string won't cut it for comparison.

It gets even worse on Windows:

The rules used for NTFS's toupper depend on the particular version of
the OS used to format the drive. See, for example:
http://blogs.msdn.com/michkap/archive/2007/10/24/5641619.aspx

...so even if you think you know the locale, you still won't be using
that particular partition's mapping.

> Luckily it's way past my bed-time, and hopefully someone will point
> out some error in my thinking by the time I wake up. Ah, how nice that
> would be.

Uh, oops.

Peter Harris

Johannes Schindelin

unread,

Aug 18, 2009, 11:26:48 AM8/18/09

to Peter Harris, Erik Faye-Lund, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

Hi,

On Tue, 18 Aug 2009, Peter Harris wrote:

> On Mon, Aug 17, 2009 at 8:50 PM, Erik Faye-Lund wrote:
> >
> > Actually, I started to think a bit about UTF-8 and case preserving
> > file systems, and got really scared. If I've understood it correctly,
> > we absolutely need to know the filename encoding to be able to do case
> > insensitive filename comparision. This might not be an issue on it's
> > own, but it seems to me like slow_same_name() is broken WRT non-ASCII
> > string comparison. Just calling toupper() on each byte of an UTF-8
> > string won't cut it for comparison.
>
> It gets even worse on Windows:
>
> The rules used for NTFS's toupper depend on the particular version of
> the OS used to format the drive. See, for example:
> http://blogs.msdn.com/michkap/archive/2007/10/24/5641619.aspx
>
> ...so even if you think you know the locale, you still won't be using
> that particular partition's mapping.

I think we're closing the time when everybody should agree that the
perfect is the enemy of the good.

If we can cope with 99% of all cases, that's good enough, methinks.

In any case, msysGit is _not_ meant for the end-user yet; so all those who
have problems should fix it themselves.

> > Luckily it's way past my bed-time, and hopefully someone will point
> > out some error in my thinking by the time I wake up. Ah, how nice that
> > would be.
>
> Uh, oops.

Heh.

Ciao,
Dscho

Erik Faye-Lund

unread,

Aug 18, 2009, 11:54:23 AM8/18/09

to Peter Harris, Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 5:17 PM, Peter Harris<peter.i...@gmail.com> wrote:
> The rules used for NTFS's toupper depend on the particular version of
> the OS used to format the drive. See, for example:
> http://blogs.msdn.com/michkap/archive/2007/10/24/5641619.aspx
>
> ...so even if you think you know the locale, you still won't be using
> that particular partition's mapping.

Ouch! Do you know if there's any win32-way of accessing those data?
Perhaps a dedicated path-comparison function?

On Tue, Aug 18, 2009 at 5:26 PM, Johannes
Schindelin<Johannes....@gmx.de> wrote:
> I think we're closing the time when everybody should agree that the
> perfect is the enemy of the good.
>
> If we can cope with 99% of all cases, that's good enough, methinks.

Yeah, I agree. So what are you proposing? Only supporting case
insensitivity for ASCII-filenames or something like that?

Johannes Schindelin

unread,

Aug 18, 2009, 12:40:06 PM8/18/09

to Erik Faye-Lund, Peter Harris, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

Hi,

On Tue, 18 Aug 2009, Erik Faye-Lund wrote:

> On Tue, Aug 18, 2009 at 5:26 PM, Johannes
> Schindelin<Johannes....@gmx.de> wrote:
> > I think we're closing the time when everybody should agree that the
> > perfect is the enemy of the good.
> >
> > If we can cope with 99% of all cases, that's good enough, methinks.
>
> Yeah, I agree. So what are you proposing? Only supporting case
> insensitivity for ASCII-filenames or something like that?

I'm proposing fixing the UTF-8 issue (without case-insensitivity) first,
and only in Git. That means that scripts will still have problems, but
you gotta start somewhere.

Ciao,
Dscho

Erik Faye-Lund

unread,

Aug 18, 2009, 12:40:43 PM8/18/09

to Peter Harris, Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 5:54 PM, Erik Faye-Lund<kusm...@googlemail.com> wrote:
> Ouch! Do you know if there's any win32-way of accessing those data?
> Perhaps a dedicated path-comparison function?

Alright, I've given it a quick stab. Here's the result:

int compare_filenames(const char *a, const char *b) {
WCHAR a_u16[MAX_PATH], b_u16[MAX_PATH];
WCHAR al[MAX_PATH], bl[MAX_PATH];
int alen = MultiByteToWideChar(CP_UTF8, 0, a, strlen(a), a_u16, MAX_PATH);
int blen = MultiByteToWideChar(CP_UTF8, 0, b, strlen(b), b_u16, MAX_PATH);
a_u16[alen] = '\0';
b_u16[blen] = '\0';
if (!GetLongPathNameW(a_u16, al, MAX_PATH)) return -1;
if (!GetLongPathNameW(b_u16, bl, MAX_PATH)) return 1;
return wcscmp(al, bl);
}

This is everything but pretty. First of all, the file has to exist, or
this fails. That sucks. Both strings are converted to the casing of
the existing filename, and those are compared. This is probably not
what we want to do, I'm just evaluating our options a bit here.

The more I look at this problem, the more I suspect that we're
attacking it from the wrong angle. Do we really want case insensitive
filenames on windows? Perhaps all we REALLY want is to have git add
the file with the correct casing on "git add", and treating everything
else as if they were different files?

Erik Faye-Lund

unread,

Aug 18, 2009, 12:43:26 PM8/18/09

to Johannes Schindelin, Peter Harris, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 6:40 PM, Johannes
Schindelin<Johannes....@gmx.de> wrote:
> I'm proposing fixing the UTF-8 issue (without case-insensitivity) first,
> and only in Git. That means that scripts will still have problems, but
> you gotta start somewhere.

Makes sense. I guess that's pretty much a reformulation of what I
suggested. This would probably mean that UTF-8 would only work when
core.ignorecase is turned off, though. At least in the first round.

Peter Harris

unread,

Aug 18, 2009, 2:03:47 PM8/18/09

to Erik Faye-Lund, Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 12:40 PM, Erik

Faye-Lund<kusm...@googlemail.com> wrote:
> On Tue, Aug 18, 2009 at 5:54 PM, Erik Faye-Lund<kusm...@googlemail.com> wrote:
>> Ouch! Do you know if there's any win32-way of accessing those data?
>> Perhaps a dedicated path-comparison function?

I haven't found one yet, sorry.

> Alright, I've given it a quick stab. Here's the result:
>
> int compare_filenames(const char *a, const char *b) {
> WCHAR a_u16[MAX_PATH], b_u16[MAX_PATH];
> WCHAR al[MAX_PATH], bl[MAX_PATH];
> int alen = MultiByteToWideChar(CP_UTF8, 0, a, strlen(a), a_u16, MAX_PATH);
> int blen = MultiByteToWideChar(CP_UTF8, 0, b, strlen(b), b_u16, MAX_PATH);
> a_u16[alen] = '\0';
> b_u16[blen] = '\0';
> if (!GetLongPathNameW(a_u16, al, MAX_PATH)) return -1;
> if (!GetLongPathNameW(b_u16, bl, MAX_PATH)) return 1;
> return wcscmp(al, bl);
> }

http://blogs.msdn.com/michkap/archive/2005/10/17/481600.aspx suggests
that you want to CharUpperW al and bl and then memcmp instead of
wcscmp, to avoid string compare functions that may consider NFC and
NFD to be the same (when the FS doesn't).

> This is everything but pretty. First of all, the file has to exist, or
> this fails. That sucks. Both strings are converted to the casing of
> the existing filename, and those are compared. This is probably not
> what we want to do, I'm just evaluating our options a bit here.

Aye. GetLongPathNameW? Oh, I see. Testing for collision with generated
short names. I'd forgotten about those. What about wcscpy(al, a_u16)
instead of return when the file doesn't exist?

> The more I look at this problem, the more I suspect that we're
> attacking it from the wrong angle. Do we really want case insensitive
> filenames on windows? Perhaps all we REALLY want is to have git add
> the file with the correct casing on "git add", and treating everything
> else as if they were different files?

The current problem, for me anyway, is that a fast-forward across a
name-case-change throws an "untracked working file will be
overwritten" error, even though it's a tracked file that won't be
overwritten (because it will be deleted first). I'm not sure adjusting
'git add' would do anything for this case.

That said, I'm being incredibly lazy over here, just lobbing
suggestions over the wall and letting you do all the hard work.
Please, feel free to scratch your own itch. My itch isn't very itchy,
or I would have scratched it myself by now. :-)

Peter Harris

Erik Faye-Lund

unread,

Aug 18, 2009, 3:22:22 PM8/18/09

to Peter Harris, Johannes Schindelin, Thorvald Natvig, Johannes Sixt, msy...@googlegroups.com

On Tue, Aug 18, 2009 at 8:03 PM, Peter Harris<peter.i...@gmail.com> wrote:
> http://blogs.msdn.com/michkap/archive/2005/10/17/481600.aspx suggests
> that you want to CharUpperW al and bl and then memcmp instead of
> wcscmp, to avoid string compare functions that may consider NFC and
> NFD to be the same (when the FS doesn't).

The point about memcmp is a good one, but CharUpperW isn't. The idea
was to use the full name, as it appears in the file system as the
canonical format used for comparison instead of upper case, since I
don't know how each partition converts the name (...and I'm not sure I
want to figure it out). This will work AFAICT even across short names
(SOMEFI~1.TXT etc) and long ones (SomeFileWithALongName.txt). Not
that's a very important point.

> Aye. GetLongPathNameW? Oh, I see. Testing for collision with generated
> short names. I'd forgotten about those. What about wcscpy(al, a_u16)
> instead of return when the file doesn't exist?

The idea (that I didn't explain, silly me) was to use these functions
when building the diff to the index, so reporting a mismatch when the
file doesn't exist should be OK... However, I strongly suspect that
it'd end up being way too slow for practical use.

Now, only one of the filenames would have to be looked up with
GetLongPathNameW(), because the other one comes from the directory
iteration, and it already returns the long name. Hopefully it also
preserves unicode characters.

> The current problem, for me anyway, is that a fast-forward across a
> name-case-change throws an "untracked working file will be
> overwritten" error, even though it's a tracked file that won't be
> overwritten (because it will be deleted first). I'm not sure adjusting
> 'git add' would do anything for this case.

It wouldn't, no.

> That said, I'm being incredibly lazy over here, just lobbing
> suggestions over the wall and letting you do all the hard work.
> Please, feel free to scratch your own itch. My itch isn't very itchy,
> or I would have scratched it myself by now. :-)

Actually, I'm not really all that itchy either. I don't need Unicode
filenames for any of my projects. I just want to make git for Windows
better ;)

Thorvald Natvig

unread,

Aug 19, 2009, 8:12:28 AM8/19/09

to Erik Faye-Lund, Peter Harris, Johannes Schindelin, Johannes Sixt, msy...@googlegroups.com

Hi,

To sum this up; the conclusion, for now, is that I should do some final
cleanup of the UTF8 patch I posted and resubmit, and we'll call that
"good enough" for now. If somebody wants something more, that's their
itch to scratch?

Best Regards,
Thorvald

Johannes Sixt

unread,

Aug 19, 2009, 2:33:08 PM8/19/09

to Thorvald Natvig, Erik Faye-Lund, Peter Harris, Johannes Schindelin, msy...@googlegroups.com

Whatever. IMO "good enough" must imply that there is no price tag attached for
users who do not use the feature. The price I mean is performance during
*common operations*. This means, I don't care if the test suite takes 10%
more time to complete, but if it takes 10% longer to complete a "Refresh"
or "Stage hunk" in git-gui, then I cannot agree.

-- Hannes

Thorvald Natvig

unread,

Aug 20, 2009, 12:22:22 PM8/20/09

to Johannes Sixt, Erik Faye-Lund, Peter Harris, Johannes Schindelin, msy...@googlegroups.com

Hi,

I've cleaned up the patches and sent an updated version to the mailing
list. "git send-email" seems to work fine, thanks for that one :)

My testing of doing 'time git status', 'time git reset --hard' etc on
the full msysgit repository hasn't revealed any performance difference;
in fact it indicates the wchar_t version to be slightly faster. In
theory, these operations should be commands that do a lot of lstat(),
directory recursion etc, so should be a good test of speed impact of
filename conversion. However, with or without this patch, the operations
complete in less than 0.3 seconds, and vary by up to 0.05 seconds, so
the test probably is not statistically valid.

To avoid the problem of quite a lot of file functions being called
before the config file is read, I moved the "desired codepage" from the
config in the original patch to a environment variable in this one. If
unset, or invalid, it will default to UTF8.

The second patch also adds proper command line interpretation of
unicode, meaning that, in cmd.exe, you can add unicode paths (they show
up as 'git add ????\?????' in the console window, but it passes the
correct wchar_t command line to the program). bash so far doesn't
understand unicode paths at all, so it passes the literal '????' to the
program, which fails. Note that mingw doesn't implement the wmain()
environment, so we have to create our own, which makes the wrapper a bit
large.

Best Regards,
Thorvald

Reply all

Reply to author

Forward