Re: extended file name support in msysgit with unicode support

1,949 views
Skip to first unread message

karste...@dcon.de

unread,
Feb 5, 2012, 8:53:42 PM2/5/12
to James Gregurich, Karsten Blees, Michael Geddes, msy...@googlegroups.com

James Gregurich <ja...@markzware.com> wrote on 06.02.2012 00:57:18:

> Hi.

>
> I got your address from Michael Geddes via the git mailing list as
> the primary developer behind unicode file name support on Windows. I
> ran into a problem with msysgit. I'd like to bring it to your
> attention as part of your effort.

>
> I have a repository with Adobe Indesign SDKs in it that are very
> deeply nested. When I tried to clone this repository on Windows, I
> got an error during the initial checkout of 'master' because the
> path names exceeded MAX_PATH (260 bytes). It appears that on
> Windows, the way to bypass the MAX_PATH limitation is to use the
> unicode versions of the I/O functions and prepend  "\\?\" to the pathname.

>  
> example (CreateFile)…  http://msdn.microsoft.com/en-
> us/library/windows/desktop/aa363858(v=vs.85).aspx

>
> Has your work addressed this issue? If not, how much work would it
> be to add support for extended file paths?

>
> -James


[Adding cc:msysgit, perhaps someone there is more knowlegeable than me regarding this issue]

While the Unicode version almost exclusively uses wide-char APIs, it also limits all filename conversions to MAX_PATH (260) characters, so no, my work doesn't address the long filename issue.

IMO, to lift the MAX_PATH restriction in Git for Windows would be quite difficult (that's why I didn't even bother when implementing the Unicode stuff):
1.) git almost exclusively uses relative paths, and \\?\ can only be used with absolute paths.
2.) most git core path operations are limited to PATH_MAX characters anyway (259 on MinGW/Windows).

And then of course there is the issue that on some Windows versions, Windows Explorer will simply crash with path names > MAX_PATH.

Perhaps you could restructure your repository so that it uses shorter paths?

Sorry,
Karsten

Johannes Schindelin

unread,
Feb 6, 2012, 12:08:29 AM2/6/12
to karste...@dcon.de, James Gregurich, Karsten Blees, Michael Geddes, msy...@googlegroups.com
Hi,

On Mon, 6 Feb 2012, karste...@dcon.de wrote:

> While the Unicode version almost exclusively uses wide-char APIs, it
> also limits all filename conversions to MAX_PATH (260) characters, so
> no, my work doesn't address the long filename issue.

IIRC modern (i.e. post-W2K) Windows versions had a longer path limit, no?
We already dropped support for Windows 98 (much to my chagrin, but it
became too difficult), so maybe it's time to drop W2K support (unless, of
course, an interested person comes along and puts in some real effort)?

Ciao,
Dscho

James Gregurich

unread,
Feb 5, 2012, 11:50:47 PM2/5/12
to karste...@dcon.de, Karsten Blees, Michael Geddes, msy...@googlegroups.com
I've joined group so I can participate.


thanks. Yes. could rearrange my sdk repository with the cost of reduced maintainability and elegance. But, I hate losing maintainability and elegance. I'd like to review your code and see what might be able to be done.




I assume I can clone "https://github.com/msysgit/git.git" to get the code. What branch should I look at to get the latest usable work you've done?

karste...@dcon.de

unread,
Feb 6, 2012, 4:42:36 AM2/6/12
to James Gregurich, Karsten Blees, Michael Geddes, msy...@googlegroups.com

James Gregurich <bayoub...@gmail.com> wrote on 06.02.2012 05:50:47:

> I've joined group so I can participate.

>
> thanks. Yes. could rearrange my sdk repository with the cost of
> reduced maintainability and elegance. But, I hate losing
> maintainability and elegance. I'd like to review your code and see
> what might be able to be done.

>
> I assume I can clone "
https://github.com/msysgit/git.git" to get the
> code. What branch should I look at to get the latest usable work you've done?

>

The latest unicode patches are here:
https://github.com/kblees/git/tree/kb/unicode-v15

However, as I said before, the PATH_MAX limit is all throughout git core (...and the entire MinGW/MSYS-based tool chain).

You might try cygwin-git instead, PATH_MAX in cygwin seems to be 4096. I don't know if it will automatically prefix long paths with \\?\ though.

karste...@dcon.de

unread,
Feb 6, 2012, 5:23:12 AM2/6/12
to Johannes Schindelin, Karsten Blees, James Gregurich, Michael Geddes, msy...@googlegroups.com
All Win32 versions (starting with Win NT 3.1 I think) support long paths in their Unicode (*W) APIs (with special \\?\ notation), the problem is not W2K.

Include/limits.h defines PATH_MAX = 259 in both MinGW and MSYS. This is mandatory at least for MinGW, as it links against MSVCRT (e.g. the MSVCRT fopen is based on CreateFileA, which doesn't support long paths). PATH_MAX is used all throughout git core, and I guess also in bash, perl and what not.

So unless you plan to ditch the entire toolchain, I don't see a feasible way to support long paths.

Cheers,

Karsten

Johannes Schindelin

unread,
Feb 6, 2012, 10:18:06 AM2/6/12
to karste...@dcon.de, Karsten Blees, James Gregurich, Michael Geddes, msy...@googlegroups.com
Hi Karsten,

On Mon, 6 Feb 2012, karste...@dcon.de wrote:

> Johannes Schindelin <Johannes....@gmx.de> wrote on 06.02.2012
> 06:08:29:
>

Thanks for the explanation, I uunderstand now!
Dscho

James Gregurich

unread,
Feb 6, 2012, 2:01:17 PM2/6/12
to karste...@dcon.de, msy...@googlegroups.com, Karsten Blees, Michael Geddes
On Feb 6, 2012, at 1:42 AM, karste...@dcon.de wrote:

The latest unicode patches are here:
https://github.com/kblees/git/tree/kb/unicode-v15

However, as I said before, the PATH_MAX limit is all throughout git core (...and the entire MinGW/MSYS-based tool chain).

Questions:

1) Did your work update the MinGW/MSYS-based tool chain to use the unicode functions for file I/O?

2) Does mysgit use standard mingw/msys or a modified version?



You might try cygwin-git instead, PATH_MAX in cygwin seems to be 4096. I don't know if it will automatically prefix long paths with \\?\ though.

I'll check into that. thanks for the recommendation.


I find it astonishing that in the year 2012, these projects haven't been updated to be unicode-saavy already. I take it they are trying to maintain compatibility with old versions of Windows that should have been retired 10 years ago? I suppose the other astonishing aspect of what I'm learning is MS' lame API for extended pathnames.  

-James

James Gregurich

unread,
Feb 9, 2012, 4:04:12 PM2/9/12
to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

After some analysis, here are my conclusions:

I think the effort to support extended paths and utf8 path names throughout the msys software stack is a worthwhile goal. The modern development landscape is international and cross-platform. It would be nice if the system just worked to give you a consistent experience across both *nix and Windows. Having 4096 utf8 bytes to work with on *nix and 260 on Windows is a big difference. Its a difference that can cause one headaches when he's trying to make complex tools work across both platforms. The symlink support that Michael was working on is also very desirable.

I've looked at the code for msys. I see no reason why using macros for functions like open & fopen to bypass those in the vc runtime woudn't work. At this point, there work is halfway done as mysys is now patched to override other Windows functions. ONce you provide the macro overrides in the mysys stdio.h, you should be able to just recompile bash, perl, etc and it would all just work. I don't see why relative paths would be a problem in terms of prepending "\\?\" because if you are calling something like open() with a relative path, its going to be relative to the current working directory. One can always query the current working directory and calculate the absolute path inside the wrapper function.


I was my goal to try such a scheme. However, I've lost so much time just fighting with the dev environment and the irritations of Windows, that I have to table my efforts. There is just too much of a learning curve for me since I'm not that familiar with Windows development. I have to move on for now. I might tinker with it in the future. If someone else wants to work on it, I might be able to contribute some dev time in support of that effort, but unfortunately, i'm not in a position where I can take the lead. If someone wants to take the lead on the effort and would like a helper, let me know.


all of that said, I think the better solution would be to stop linking apps against MS' standard lib and create an msys version of libc. This library would dynamically load the vc runtime and call the functions by means of dynamically acquired function pointers. That way, you don't have to play macro-meta-programming games (which are problematic) and you would have full freedom to extend MS' runtime in a more posix friendly way.


BTW: I did look at cygwin. I couldn't find in their source code where they patch open and fopen. I couldn't find it. But, even if they do support the extended file names, they don't translate unix symlinks to Windows symlinks. They probably would if the morons at Microsoft hadn't made mklink a privileged operation. But, that's the way it is. And those of us looking to integrate Windows into *nix development efforts just have to hack things and go through extra maintenance hassles.


BTW. Does any one know what role "newlib" plays in msys and cywin? I couldn't exactly discern that.


Arioch

unread,
Jul 19, 2012, 3:25:57 AM7/19/12
to msy...@googlegroups.com
Well, let me jump into this leaving bandwagon too.

For x-reference: https://github.com/msysgit/git/issues/24


> Having 4096 utf8 bytes to work with on *nix

utf-8 char is from 1 to 4 bytes.
So is it 4096 bytes == 1K utf-8 chars ?
Or 4K chars are 16KBytes ?

> and 260 on Windows is a big difference.

Frankly, it just makes impossible to participate in some projects.
There is JGit in Eclipse, but it is much harder to use and much less comfort if
you would not use Eclispe as main IDE.
At least i could not download anonymously project from github HTTP/SSL via E 4.2

> The symlink support that Michael was working on is also very desirable.

Symlinks in Windows are rather poorly supported, especially by 3rd-party
programs like filemanagers. Would i move the repo folder - would it update all
the links ? who knows.

While someone should push for bleeding edge, i guess for version control main
priority is reliability, so i think Git is not to rely on symlinks by default.

> I've looked at the code for msys. I see no reason why using macros for
functions like open & fopen to bypass
> those in the vc runtime woudn't work. At this point, there work is halfway
done as mysys is now patched to
> override other Windows functions.

...well. Is here talk about Windows functions (Win32/Win64 API) or MS VisualC++
runtime ?

Windows offers internal and deprecated functions FileOpen and _lopen, for what
i remember. And official Windows open file API is CreateFile function (which as
almost all Win API dual - CreateFileA and CreateFileW for non-Unicode and
UTF-16 operations respectively).
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

What about fopen - it is from VisualC++ runtime, such as VisualC++ 6.0
msvcrt.dll
Frankly, that library is treated obsolete now and most projects also drifted to
more recent VC++ versions and mcvc runtimes of 9.0 or 10.0 versions.

However even MSVCRT 6.0 has some kind of Unicode support:

http://support.microsoft.com/kb/125750 has reference of Unicode MFC apps.

http://msdn.microsoft.com/en-us/library/aa246392(v=vs.60).aspx has comparision
of fopen flavours. I am not C++ programmers but for what i saw "_T" is standard
MS VC++ way to say "maybe unicode maybe not, depending on project options"

I would like to name view Unicode-aware projects that uses or used MS VC
runtime of different versions.

7-zip.org - AFAIR used Unicode ever.
MediaInfo.sf.net - had Unicode build
FirebirdSQL.org - server engine is built against both some modern MSVCRT (don't
remember aexact version) and IBM Classes for Unicode. But IDE is built against
msvcr71.dll without ICU and also is unicode-aware.

Overall i think even v.6.0 MSVCRT.DLL does offer Unicode-aware file services.
Frankly, i though internals of git are already unicode-aware.

> path, its going to be relative to the current working directory. One can
always query the current working
> directory and calculate the absolute path inside the wrapper function.

There is Windows API for that, that accounts for network paths, ".." traversing
and such.
GetFullPathName http://msdn.microsoft.com/en-us/library/windows/desktop/
aa364963(v=vs.85).aspx

I guess it should be more reliable than self-made fn, that might trn obsolete
in future. For example those symlinks and such.

----------

> Windows symlinks. They probably would if the morons at Microsoft
> hadn't made mklink a privileged operation.

Such an insult :-)
As a person from Windows side (but not C++ side, sorry) i would say that only
morons of UNIX fame could make mount/umount privileged operation.
And that made all the gotcha's with all those workarounds for it on desktop
Linux (automount, submount, D-Bus/HAL relying daemons, and something new each
year). Frankly i pity the Hurd, their "file translators" features cold really
rock.

But well. UNIX was not supposed for desktops, and DOS was not supposed for
complex environments. That made architectural decisions and compatibility
contracts that enforced those annoying obstacles, not some evil morons.




Erik Faye-Lund

unread,
Jul 19, 2012, 3:45:37 AM7/19/12
to Arioch, msy...@googlegroups.com
On Thu, Jul 19, 2012 at 9:25 AM, Arioch <the_A...@nm.ru> wrote:
>> path, its going to be relative to the current working directory. One can
> always query the current working
>> directory and calculate the absolute path inside the wrapper function.
>
> There is Windows API for that, that accounts for network paths, ".." traversing
> and such.
> GetFullPathName http://msdn.microsoft.com/en-us/library/windows/desktop/
> aa364963(v=vs.85).aspx
>
> I guess it should be more reliable than self-made fn, that might trn obsolete
> in future. For example those symlinks and such.
>

...but how do you pass a relative path that is longer than MAX_PATH to
it? If you prepend "\\?\" to it as the documentation suggests, surely
you'll have to pass the absolute path also? Otherwise, how would it
know that it's a relative path?

Arioch

unread,
Jul 19, 2012, 5:29:39 AM7/19/12
to Erik Faye-Lund, msy...@googlegroups.com
Well taken!

I have few hypothetical ideas but sure it would better be discussed in some MS community, rather than narrow git/win or even wider msys communities :-)

James Gregurich

unread,
Jul 19, 2012, 11:32:37 AM7/19/12
to kusm...@gmail.com, Arioch, msy...@googlegroups.com


Sent from my iPad

On Jul 19, 2012, at 12:45 AM, Erik Faye-Lund <kusm...@gmail.com> wrote:

>> .
>>
>
> ...but how do you pass a relative path that is longer than MAX_PATH to
> it? If you prepend "\\?\" to it as the documentation suggests, surely
> you'll have to pass the absolute path also? Otherwise, how would it
> know that it's a relative path?
>
> --
>

I went through this with my Cygwin project. The have to convert the relative path to an absolute path before you make the system API call with it.

James Gregurich

unread,
Jul 19, 2012, 11:52:26 AM7/19/12
to Arioch, msy...@googlegroups.com


Sent from my iPad

On Jul 19, 2012, at 12:25 AM, Arioch <the_A...@nm.ru> wrote:

> .
>
> ----------
>
>> Windows symlinks. They probably would if the morons at Microsoft
>> hadn't made mklink a privileged operation.
>
> Such an insult :-)


The following is what I did on my Cygwin project:

Cygwin implements posix symlinks by creating a text file with a posix path in it. The file is set to be identified as a system file. This mechanism is ofcourse not recognized by windows. I call this the default form of the symlink.

I modified cygwin's API call so that if the calling process has appropriate rights and the target file exists, it invokes windows symlink API call. If these conditions are not true, then I fall back to the default form. I then provided a unix command for the Cygwin environment that would read a default form symlink and update it to a windows native if the process has the needed privileges and the target file exists. I modified my repositories to have a checkout hook that invokes the updater command with the find function to scan the entire working directory for symlinks and attempt to update them. I modified the templates in git so that any new repository on the system automatically gets the hooks.

During use, At worst, the symlink is in the default form and doesn't work on windows. It's never invalid from the posix environment. but it is easy to run the updater command to make it valid in windows also.


My company has been using this system in production for a few months and it seems to be working well for us. It does create any maintenance problems while providing us the ability to cleanly share our repositories between OSx and windows. There have been no reliability problems. It all just works. Symlinks are handled as well as UTF8 path names and extended length path names.

I don't know what you would have to do to implement this in msys, but I have prototyped it in Cygwin and it works well.

Erik Faye-Lund

unread,
Jul 19, 2012, 12:33:48 PM7/19/12
to James Gregurich, Arioch, msy...@googlegroups.com
Well, you have the benefit of having a system where someone did all
the hard work :)

I don't think even this "default form" stuff would work at all without
major hacking in MinGW (remember, Git for Windows is a MinGW
application, not a MSYS one), because mingw-rt does no emulation of
symbolic links whatsoever. Just storing the link-target in a text-file
does not make st.st_mode have the S_IFLNK-flag.

Just patching stat etc doesn't quite cut it either; we need a way of
robustly differentiating a symlink from something that looks like one.
I believe cygwin keeps some side-storage for the file-system for this,
but I don't thing MSYS does. It's not impossible to recreate it, but
then (IMO) you've reached the point where what you really want is
Cygwin. So why not just use Cygwin?

James Gregurich

unread,
Jul 19, 2012, 3:07:36 PM7/19/12
to kusm...@gmail.com, Arioch, msy...@googlegroups.com

On Jul 19, 2012, at 9:33 AM, Erik Faye-Lund wrote:

>
> Well, you have the benefit of having a system where someone did all
> the hard work :)
>
> I don't think even this "default form" stuff would work at all without
> major hacking in MinGW (remember, Git for Windows is a MinGW
> application, not a MSYS one), because mingw-rt does no emulation of
> symbolic links whatsoever. Just storing the link-target in a text-file
> does not make st.st_mode have the S_IFLNK-flag.
>
> Just patching stat etc doesn't quite cut it either; we need a way of
> robustly differentiating a symlink from something that looks like one.
> I believe cygwin keeps some side-storage for the file-system for this,
> but I don't thing MSYS does. It's not impossible to recreate it, but
> then (IMO) you've reached the point where what you really want is
> Cygwin. So why not just use Cygwin?


Well, I did.

I looked at modifying msys before I looked at cygwin. The main reason I picked cygwin is because it provides its own C runtime as a layer over the MS runtime. So, there is plenty of opportunity to contrive posix behavior. Also, cygwin already implemented the basic support for posix symlinks, unicode path names, extended path names and reading reparse points (native windows symlink). I just need to tack on the logic to create the reparse points.

The practical advantage of msys-git is that git-gui runs as a native app using Active-tcl whereas with using cygwin, it runs as an x-11 app. Also, cygwin is pretty heavy-weight. I made an effort to try to get cygwin to use Active-tcl, but I was not successful in making it work. it was more work than I wanted to invest. X11 is fine for my needs.

The way cygwin identifies a file in the NTFS file system as a default-form symlink is by having the system file flag set on it and by the presence of a fixed byte sequence that prepends the posix path. If the file is a system file and the data fits the pattern, then it is presented in the posix layer as a symlink. Because the system flag is set on the file, Windows considers it opaque so no other process has any interest in touching the contents of the file.


The reason I'm taking the time to write in is because I saw the discussion of the subject. I know it has come up in the past. So, I thought I'd share the results of my research and experience. it took me several months to research, develop and deploy this system. If someone wants the source code to think about how it might be useful in msys, he can have it. The key point is that my cygwin work demonstrates that the approach works in practice. So, it wouldn't be a waste of time to pursue it.

-James



dragon...@gmail.com

unread,
Mar 30, 2014, 9:08:51 PM3/30/14
to msy...@googlegroups.com, kusm...@gmail.com, Arioch

dragon...@gmail.com

unread,
Mar 30, 2014, 9:12:30 PM3/30/14
to msy...@googlegroups.com, kusm...@gmail.com, Arioch


W dniu czwartek, 19 lipca 2012 21:07:36 UTC+2 użytkownik James Gregurich napisał:

dragon...@gmail.com

unread,
Mar 30, 2014, 9:15:54 PM3/30/14
to msy...@googlegroups.com, kusm...@gmail.com, Arioch


W dniu czwartek, 19 lipca 2012 21:07:36 UTC+2 użytkownik James Gregurich napisał:

dragon...@gmail.com

unread,
Mar 30, 2014, 9:17:34 PM3/30/14
to msy...@googlegroups.com, James Gregurich, Arioch, kusm...@gmail.com

dragon...@gmail.com

unread,
Mar 31, 2014, 7:07:12 PM3/31/14
to msy...@googlegroups.com, Karsten Blees, Michael Geddes, karste...@dcon.de
Reply all
Reply to author
Forward
0 new messages