win32 + wide filenames

Dan Sully

unread,

Dec 9, 2005, 12:31:33 PM12/9/05

to perl5-...@perl.org

I'm trying to debug a user issue reading Japanese filenames on Win32
(ActiveState). The user is running in code page 1252, but has Japanese
filenames on disk. These filenames show up fine in Windows Explorer.o

Digging through the perl code, in win32/win32.c - win32_opendir(),
I find this:

/* do the FindFirstFile call */
if (USING_WIDE()) {
A2WHELPER(scanname, wbuffer, sizeof(wbuffer));
fh = FindFirstFileW(PerlDir_mapW(wbuffer), &wFindData);
}
else {
fh = FindFirstFileA(PerlDir_mapA(scanname), &aFindData);
}

A Google later tells me that USING_WIDE() is deprecated. Which means that
Windows filenames will always come back as single bytes, and anything out of
the local code page turns into a question mark.

I have a two part question - Why is USING_WIDE() deprecated? And is there no
other way to read multi-byte character filenames on Windows? This seems like
a severe limitation to me if it is.

Thanks.

-D
--
"A good messenger expects to get shot." --Larry Wall

Jan Dubois

unread,

Dec 9, 2005, 12:46:26 PM12/9/05

to Dan Sully, perl5-...@perl.org

On Fri, 09 Dec 2005, Dan Sully wrote:
> A Google later tells me that USING_WIDE() is deprecated. Which means
> that Windows filenames will always come back as single bytes, and
> anything out of the local code page turns into a question mark.

It is actually not just deprecated, it is totally disabled in Perl 5.8
as the USING_WIDE() macro is defined as a constant 0 in win32/win32.h.

> I have a two part question - Why is USING_WIDE() deprecated?

USING_WIDE() was used in Perl 5.005 to tell switch between single byte
and double byte character encodings on Windows. This was before Perl
itself had Unicode support at all and was enabled globally with the -C
commandline option.

This code is incompatible with the way Perl 5.6 and later treat strings,
which are sometimes just single byte streams, and sometimes UTF8 encoded.
Unfortunately Perl doesn't pass the internal SV structures to the I/O
layer, but just the char* pointers, so there is no way of knowing if a
filename is UTF8 encoded, or just a single byte filename containing high-bit
characters.

> And is there no other way to read multi-byte character filenames on Windows?

Actually, there is a workaround using the Win32API::File module:

http://groups.google.com/group/perl.unicode/msg/86ab5af239975df7

> This seems like a severe limitation to me if it is.

Yes, this is indeed unfortunate. It is not easy to fix though, but at least
there is a workaround.

Cheers,
-Jan

Nick Ing-Simmons

unread,

Dec 9, 2005, 1:24:07 PM12/9/05

to ja...@activestate.com, Dan Sully, perl5-...@perl.org

Jan Dubois <ja...@ActiveState.com> writes:
>
>This code is incompatible with the way Perl 5.6 and later treat strings,
>which are sometimes just single byte streams, and sometimes UTF8 encoded.
>Unfortunately Perl doesn't pass the internal SV structures to the I/O
>layer, but just the char* pointers, so there is no way of knowing if a
>filename is UTF8 encoded, or just a single byte filename containing high-bit
>characters.

This is indeed a pain. In other places in perl which pass char * and length
special values of length get used to signal UTF8 (or in weird cases that
char * is really SV *). But IO isn't even consistent on having a length
argument.
Sadly bewteen the introduction of PerlIO abstraction (5.003 ish)
and UTF-8 being workable (5.8) too many CPAN XS modules were using
PerlIO to make an API change viable.

Best I can suggest is that if (new, reinvented) USING_WIDE global is
set then ALL SVs used for filenames are upgraded to UTF8 before char *
is passed to IO. OS glue layer would then know where it stood.
This also happens to be de-facto how one does widenames on Unix/Linux.

That being true one could perhaps ALWAYS do it, and if !USING_WIDE
downgrade in glue code (at least woth perl5.8 there is always _some_
perl glue code - a layer - to put this stuff in).

Dan Sully

unread,

Dec 9, 2005, 1:16:49 PM12/9/05

to Jan Dubois, perl5-...@perl.org

* Jan Dubois shaped the electrons to say...

>Unfortunately Perl doesn't pass the internal SV structures to the I/O
>layer, but just the char* pointers, so there is no way of knowing if a
>filename is UTF8 encoded, or just a single byte filename containing high-bit
>characters.

Doesn't that same issue exist on *nix? And it's up to the whims of LC_CTYPE?

I've been able to work around that by using Encode to guess the encoding, and
the converting from there. I'd just like to get all the bytes. Not truncated versions.

>> And is there no other way to read multi-byte character filenames on Windows?
>
>Actually, there is a workaround using the Win32API::File module:
>
> http://groups.google.com/group/perl.unicode/msg/86ab5af239975df7

Hrm. Doesn't look like Win32API::File provides FindFirstFileW()

>> This seems like a severe limitation to me if it is.
>
>Yes, this is indeed unfortunate. It is not easy to fix though, but at least
>there is a workaround.

What is the correct thing to do in win32/win32.c ?

-D
--
<weezyl> $6.66: The Value Meal of the Beast.

Jan Dubois

unread,

Dec 9, 2005, 1:45:15 PM12/9/05

to Nick Ing-Simmons, Dan Sully, perl5-...@perl.org

On Fri, 09 Dec 2005, Nick Ing-Simmons wrote:
> Best I can suggest is that if (new, reinvented) USING_WIDE global is
> set then ALL SVs used for filenames are upgraded to UTF8 before char *
> is passed to IO. OS glue layer would then know where it stood.
> This also happens to be de-facto how one does widenames on Unix/Linux.
>
> That being true one could perhaps ALWAYS do it, and if !USING_WIDE
> downgrade in glue code (at least woth perl5.8 there is always _some_
> perl glue code - a layer - to put this stuff in).

I think this is still error-prone as the CPAN XS code wouldn't know that it
has to upgrade all SVs to UTF8. In addition, upgrading and downgrading
is still fragile due to local code page issues.

I think the "proper" way to do this would be to duplicate the complete
PerlIO interface with a new one that accepts both SV* and char* arguments,
and then redirect the old API into the new one.

Cheers,
-Jan

Jan Dubois

unread,

Dec 9, 2005, 2:20:40 PM12/9/05

to Dan Sully, perl5-...@perl.org

On Fri, 09 Dec 2005, Dan Sully wrote:
> * Jan Dubois shaped the electrons to say...
>

> >> And is there no other way to read multi-byte character filenames on Windows?
> >
> >Actually, there is a workaround using the Win32API::File module:
> >
> > http://groups.google.com/group/perl.unicode/msg/86ab5af239975df7
>
> Hrm. Doesn't look like Win32API::File provides FindFirstFileW()

Well, it could be added eventually. Patches welcome! :)
That discussion should probably go to libw...@perl.org instead.

Another workaround for Unicode filenames on Windows is using Win32::OLE
and the Scripting.FileSystemObject. Here is an example:

use strict;
use warnings;
use Win32::OLE qw(in);

Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);

my $fso = Win32::OLE->new("Scripting.FileSystemObject");
my $folder = $fso->GetFolder('c:\temp');
foreach my $file (in $folder->Files) {
print $file->Name, "\n";
}

All string values will be properly turned into Perl SVs with the UTF8 bit set
as needed. So the code above will generate "Wide character in print" warnings
for Japanese filenames.

You can then access the files using the Win32API::File trick mentioned earlier,
or you can use the OpenAsTextStream() method on the File object do do your I/O
via OLE as well.

> >> This seems like a severe limitation to me if it is.
> >
> >Yes, this is indeed unfortunate. It is not easy to fix though, but at least
> >there is a workaround.
>
> What is the correct thing to do in win32/win32.c ?

There is no simple thing to do in the Win32 layer. It needs structural changes
in Perl itself, as Nick and I discussed in a separate message.

Cheers,
-Jan