Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Perl and unicode file names

678 views
Skip to first unread message

Peter Gordon

unread,
Feb 24, 2005, 6:12:28 AM2/24/05
to perl-u...@perl.org
Hi Guys.

I need some help with a project that I have. I have to copy files using
Perl to different places and the filenames may be in Hebrew, Chinese,
Korean etc.

The problem is, that filenames, when using opendir, are returned as
question marks. In the DOS box I have set the codepage to 862. So DIR
returns accented characters, but Perl still returns question marks. I
have also set "use utf8", but that didn't help either.

So the problem I have is how to proceed. Should I give up with Perl and
use Java or C? Any suggestions gratefully received.

Regards,

Peter
--
Peter Gordon
Phone: +972 544 438029
Email: pe...@pg-consultants.com
Web: www.pg-consultants.com

Peter Gordon

unread,
Feb 24, 2005, 9:44:16 AM2/24/05
to Guido Flohr, perl-u...@perl.org
I am working on XP. If I leave the active code page as default, when I
do dir, I get question marks for the file name. If I change the code
page to 862, for example, I get accented Latin characters.

However, no matter what I do in Perl, I get real question marks back. I
know that because I dumped the values with ord(). It is ascii 63.

Peter

On Thu, 2005-02-24 at 15:23 +0100, Guido Flohr wrote:
> Hi,
>
> sorry, my original reply (see below) went to the sender, not to the list.
>
> Peter Gordon wrote:
> > I am using ActiveState Perl 5.008006.
> >
> > I am trying on Hebrew filenames at the moment, but the program will need
> > to run on all languages.
>
> The language does not matter, it is the charset. Hebrew can be coded in
> Unicode/UTF-8 or iso-8859-8 or cp-whatever. You really have to find out
> which charset your file system uses.
>
> > I tried "use bytes" and still get back question marks.
>
> What is "back" and what are the "question marks"? Do you see "back" (the
> output of your script) in your terminal window/DOS box or in an output
> file? And are there really question marks or are they not displayed
> correctly?
>
> Does your script throw warnings? Do you "use warnings"?
>
> > That's all the information that I have.
>
> The information about the charset used in your input data is required. A
> simple way to find that out goes like this:
>
> #! /usr/bin/perl
>
> use strict;
> use warnings;
> use bytes;
>
> opendir DIR, "/path/to/dir" or die "opendir: $!";
> my @files = readdir DIR;
>
> open HANDLE, ">filelist.html" or die "open filelist.html: $!";
> print HANDLE "<html><body><ul>\n";
> foreach (@files) {
> print HANDLE "<li>$_</li>\n";
> }
> print HANDLE "</body></html>\n";
> __END__
>
> Provided that you have changed the path argument to opendir in line
> 7 this will create a "filelist.html" in the current directory. Open
> that file in a browser and then change the encoding to some western
> european charset like iso-8859-1 or windows-1252. In Mozilla this is
> View->Chacter Encoding->...
>
> When you see question marks here, then they are real, i. e. something
> (readdir, the OS?) has converted the input to question marks. Otherwise
> you should see accented western european characters instead of Hebrew.
>
> Now change the encoding to utf-8/Unicode. Question marks? Then it is
> _not_ Unicode.
>
> Change it to some Hebrew character set. You see Hebrew? Then you have
> an 8 bit Hebrew character set, probably IBM-862 or ISO-8859-8.
>
> Both utf-8 and 8 bit character sets only show question marks or empty
> boxes? Then your font probably lacks the Hebrew glyphs.
>
> You can make the test again with "use utf8" and compare the results.
>
> What is your script supposed to do? If you just want to pass data from
> here to there, you have no problem. But if you want to process it
> together with data from other languages, you have to make sure that all
> data is converted to Unicode internally.
>
> Guido
>
> My original reply below:


>
> >>>The problem is, that filenames, when using opendir, are returned as
> >>>question marks. In the DOS box I have set the codepage to 862. So DIR
> >>>returns accented characters, but Perl still returns question marks. I
> >>>have also set "use utf8", but that didn't help either.
> >>

> >>Are the filenames really in UTF-8? If not, you would need "use bytes"
> >>instead of "use utf8". If that dos not help, you should give more
> >>detailed information: Which Perl version? Which character sets are
> >>actually used in the filenames?


> >>
> >>
> >>>So the problem I have is how to proceed. Should I give up with Perl and
> >>>use Java or C? Any suggestions gratefully received.
> >>

> >>Do you want to blackmail us? ;-)
> >>
> >>Regards,
> >>Guido

Guido Flohr

unread,
Feb 24, 2005, 9:23:01 AM2/24/05
to perl-u...@perl.org
Hi,

#! /usr/bin/perl

Guido

My original reply below:

>>>The problem is, that filenames, when using opendir, are returned as


>>>question marks. In the DOS box I have set the codepage to 862. So DIR
>>>returns accented characters, but Perl still returns question marks. I
>>>have also set "use utf8", but that didn't help either.
>>

>>Are the filenames really in UTF-8? If not, you would need "use bytes"
>>instead of "use utf8". If that dos not help, you should give more
>>detailed information: Which Perl version? Which character sets are
>>actually used in the filenames?
>>
>>

>>>So the problem I have is how to proceed. Should I give up with Perl and
>>>use Java or C? Any suggestions gratefully received.
>>

>>Do you want to blackmail us? ;-)
>>
>>Regards,
>>Guido


--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.net/

Ed Batutis

unread,
Feb 24, 2005, 10:39:49 AM2/24/05
to Peter Gordon, Guido Flohr, perl-u...@perl.org
> >>>So the problem I have is how to proceed. Should I give up with Perl and
> >>>use Java or C? Any suggestions gratefully received.
> >>

I started a really 'fun' flame war on this topic several months ago, so I
hesitate to say anything more. But, yes, you should give up on Perl - or run
your script on Linux with a utf-8 locale. On Win32, Perl internals are
converting the filename characters to the system default code page. So, you
are SOL for what you are trying to do.

=Ed Batutis
e...@batutis.com

Jan Dubois

unread,
Feb 24, 2005, 11:15:16 AM2/24/05
to Ed Batutis, Peter Gordon, Guido Flohr, perl-u...@perl.org

Actually, you *can* work around the problems on Windows by using the
Win32API::File and the Encode module. Here is a sample program
Gisle came up with:

#!perl -w

use strict;
use Fcntl qw(O_RDONLY);

use Win32API::File qw(CreateFileW OsFHandleOpenFd :FILE_ OPEN_EXISTING);
use Encode qw(encode);

binmode(STDOUT, ":utf8");

my $h = CreateFileW(encode("UTF-16LE", "\x{2030}.txt\0"), FILE_READ_DATA,
0, [], OPEN_EXISTING, 0, []);

my $fd = OsFHandleOpenFd($h, O_RDONLY);
die if $fd < 0;
open(my $fh, "<&=$fd");
binmode($fh, ":encoding(UTF-16LE)");
while (<$fh>) {
print $_;
}
close($fh) || die;
__END__

It may be possible to do similar readdir() emulation as well.

Win32::APIFile is part of libwin32 and already included in ActivePerl.

Cheers,
-Jan


Christian Hansen

unread,
Feb 24, 2005, 8:33:09 AM2/24/05
to perl-u...@perl.org
Peter Gordon wrote:
> Hi Guys.
>
> I need some help with a project that I have. I have to copy files using
> Perl to different places and the filenames may be in Hebrew, Chinese,
> Korean etc.
>
> The problem is, that filenames, when using opendir, are returned as
> question marks. In the DOS box I have set the codepage to 862. So DIR
> returns accented characters, but Perl still returns question marks. I
> have also set "use utf8", but that didn't help either.
>
> So the problem I have is how to proceed. Should I give up with Perl and
> use Java or C? Any suggestions gratefully received.

I don't think you have to give up using Perl.

Something like this should work:

#!/usr/bin/perl

use strict;
use warnings;

use Encode;
use IO::Dir;

# Let perl know that we want to output cp862 on STDOUT
binmode( STDOUT, ':encoding(cp862)' );

my $dir = IO::Dir->new('.')
or die("Failed to open dir : $!");

while ( my $entry = $dir->read ) {

next if $entry =~ /^\.{1,2}$/;

# Decode octets into perl's internal unicode encoding
$entry = Encode::decode_utf8($entry, 1);

printf( "%s\n", $entry );
}

$dir->close;


Regards
Christian

Christian Hansen

unread,
Feb 24, 2005, 10:34:22 AM2/24/05
to Peter Gordon, perl-u...@perl.org
Peter Gordon wrote:
> Hi Guys.
>
> I need some help with a project that I have. I have to copy files using
> Perl to different places and the filenames may be in Hebrew, Chinese,
> Korean etc.
>
> The problem is, that filenames, when using opendir, are returned as
> question marks. In the DOS box I have set the codepage to 862. So DIR
> returns accented characters, but Perl still returns question marks. I
> have also set "use utf8", but that didn't help either.
>
> So the problem I have is how to proceed. Should I give up with Perl and
> use Java or C? Any suggestions gratefully received.

I don't think you have to give up using Perl.

0 new messages