Digging through the perl code, in win32/win32.c - win32_opendir(),
I find this:
/* do the FindFirstFile call */
if (USING_WIDE()) {
A2WHELPER(scanname, wbuffer, sizeof(wbuffer));
fh = FindFirstFileW(PerlDir_mapW(wbuffer), &wFindData);
}
else {
fh = FindFirstFileA(PerlDir_mapA(scanname), &aFindData);
}
A Google later tells me that USING_WIDE() is deprecated. Which means that
Windows filenames will always come back as single bytes, and anything out of
the local code page turns into a question mark.
I have a two part question - Why is USING_WIDE() deprecated? And is there no
other way to read multi-byte character filenames on Windows? This seems like
a severe limitation to me if it is.
Thanks.
-D
--
"A good messenger expects to get shot." --Larry Wall
It is actually not just deprecated, it is totally disabled in Perl 5.8
as the USING_WIDE() macro is defined as a constant 0 in win32/win32.h.
> I have a two part question - Why is USING_WIDE() deprecated?
USING_WIDE() was used in Perl 5.005 to tell switch between single byte
and double byte character encodings on Windows. This was before Perl
itself had Unicode support at all and was enabled globally with the -C
commandline option.
This code is incompatible with the way Perl 5.6 and later treat strings,
which are sometimes just single byte streams, and sometimes UTF8 encoded.
Unfortunately Perl doesn't pass the internal SV structures to the I/O
layer, but just the char* pointers, so there is no way of knowing if a
filename is UTF8 encoded, or just a single byte filename containing high-bit
characters.
> And is there no other way to read multi-byte character filenames on Windows?
Actually, there is a workaround using the Win32API::File module:
http://groups.google.com/group/perl.unicode/msg/86ab5af239975df7
> This seems like a severe limitation to me if it is.
Yes, this is indeed unfortunate. It is not easy to fix though, but at least
there is a workaround.
Cheers,
-Jan
This is indeed a pain. In other places in perl which pass char * and length
special values of length get used to signal UTF8 (or in weird cases that
char * is really SV *). But IO isn't even consistent on having a length
argument.
Sadly bewteen the introduction of PerlIO abstraction (5.003 ish)
and UTF-8 being workable (5.8) too many CPAN XS modules were using
PerlIO to make an API change viable.
Best I can suggest is that if (new, reinvented) USING_WIDE global is
set then ALL SVs used for filenames are upgraded to UTF8 before char *
is passed to IO. OS glue layer would then know where it stood.
This also happens to be de-facto how one does widenames on Unix/Linux.
That being true one could perhaps ALWAYS do it, and if !USING_WIDE
downgrade in glue code (at least woth perl5.8 there is always _some_
perl glue code - a layer - to put this stuff in).
>Unfortunately Perl doesn't pass the internal SV structures to the I/O
>layer, but just the char* pointers, so there is no way of knowing if a
>filename is UTF8 encoded, or just a single byte filename containing high-bit
>characters.
Doesn't that same issue exist on *nix? And it's up to the whims of LC_CTYPE?
I've been able to work around that by using Encode to guess the encoding, and
the converting from there. I'd just like to get all the bytes. Not truncated versions.
>> And is there no other way to read multi-byte character filenames on Windows?
>
>Actually, there is a workaround using the Win32API::File module:
>
> http://groups.google.com/group/perl.unicode/msg/86ab5af239975df7
Hrm. Doesn't look like Win32API::File provides FindFirstFileW()
>> This seems like a severe limitation to me if it is.
>
>Yes, this is indeed unfortunate. It is not easy to fix though, but at least
>there is a workaround.
What is the correct thing to do in win32/win32.c ?
-D
--
<weezyl> $6.66: The Value Meal of the Beast.
I think this is still error-prone as the CPAN XS code wouldn't know that it
has to upgrade all SVs to UTF8. In addition, upgrading and downgrading
is still fragile due to local code page issues.
I think the "proper" way to do this would be to duplicate the complete
PerlIO interface with a new one that accepts both SV* and char* arguments,
and then redirect the old API into the new one.
Cheers,
-Jan
Well, it could be added eventually. Patches welcome! :)
That discussion should probably go to libw...@perl.org instead.
Another workaround for Unicode filenames on Windows is using Win32::OLE
and the Scripting.FileSystemObject. Here is an example:
use strict;
use warnings;
use Win32::OLE qw(in);
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);
my $fso = Win32::OLE->new("Scripting.FileSystemObject");
my $folder = $fso->GetFolder('c:\temp');
foreach my $file (in $folder->Files) {
print $file->Name, "\n";
}
All string values will be properly turned into Perl SVs with the UTF8 bit set
as needed. So the code above will generate "Wide character in print" warnings
for Japanese filenames.
You can then access the files using the Win32API::File trick mentioned earlier,
or you can use the OpenAsTextStream() method on the File object do do your I/O
via OLE as well.
> >> This seems like a severe limitation to me if it is.
> >
> >Yes, this is indeed unfortunate. It is not easy to fix though, but at least
> >there is a workaround.
>
> What is the correct thing to do in win32/win32.c ?
There is no simple thing to do in the Win32 layer. It needs structural changes
in Perl itself, as Nick and I discussed in a separate message.
Cheers,
-Jan