Opening Unicode files?

Ilya Zakharevich

unread,

Dec 24, 2011, 8:52:10 PM12/24/11

to

Does Perl ship with a simple method of opening Unicode files? E.g., I
would like to have something like

open my $fh, '< :BOM0or(utf8)', $filename

where BOM0or does what Perl itself does for Perl files: it looks for the
first 4 bytes; given that a Perl file starts in ASCII, one can detect
BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
is none of the above (then the arument in parens explains what to do;
e.g., Perl itself does BOM0or(latin1)).

Likewise, if one does not know that the file starts in ASCII, one can
still detect BOM (which does not appear often in the encodings I know)
so one could do :BOMor(utf8). Do not recollect seeing such support
for files open()ed by Perl programs; is there?

Thanks,
Ilya

r.mar...@fdcx.net

unread,

Dec 26, 2011, 9:13:48 AM12/26/11

to

Here's what I use and it seems to do what's needed:

use File::BOM qw( :all );

# Open specified input file
open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
file ($IF)!\n";

Ben Morrow

unread,

Dec 27, 2011, 7:32:43 AM12/27/11

to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:

Encode::Guess, which can be invoked as

open my $fh, '< :encoding(Guess)', $filename

Somewhat annoyingly, you have to explicitly use Encode::Guess or it
won't recognise the encoding name, and you have to use
Encode::Guess->set_suspects to set the list of encodings to try.

Ben

Ilya Zakharevich

unread,

Dec 27, 2011, 4:17:31 PM12/27/11

to

On 2011-12-26, r.mar...@fdcx.net <r.mar...@fdcx.net> wrote:
>>Does Perl ship with a simple method of opening Unicode files? E.g., I
>>would like to have something like
>>
>> open my $fh, '< :BOM0or(utf8)', $filename
>>
>>where BOM0or does what Perl itself does for Perl files: it looks for the
>>first 4 bytes; given that a Perl file starts in ASCII, one can detect
>>BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
>>is none of the above (then the arument in parens explains what to do;
>>e.g., Perl itself does BOM0or(latin1)).

Thinking about it more, there are 3 situations:

a) we know that the first 2 characters in the file are 7-bit, and
are not 0. Then read the first 2 bytes; if both 0, it is 32BE
(possibly with [hardly legal] BOM); if BOM-BE, it is 16BE+BOM; if
high bits are set, it is UTF-8+BOM; if the first byte is 0, it is
16BE.

One needs to read the other 2 bytes only if 32BE is detected (and
only if one wants to guard against BOM) and if the second byte is
0 - then it may be 16LE or 32LE.

The only possible confusion is whether the file is actually in
Unicode encoding, or in an 8-bit encoding (or between UTF-7 and
UTF-8-no-BOMs).

b) The only thing known is that the first 2 chars are not 0. Again,
one reads 2 bytes - but now there is no way to detect UTF-8-BOM.

c) The only thing known is that the fist 2 chars are 7-bit. Then
there is no way to detect BOMless UTF-16.

d) General case: 8-bit chars may be present.

It looks like the decision algorithms are DIFFERENT in these 4 cases;
hence one needs 4 different "filters": One can call them BOM07, BOM08,
BOM7, and BOM8.

> Here's what I use and it seems to do what's needed:
>
> use File::BOM qw( :all );

And do you know from which version it is shipped with Perl?

> # Open specified input file
> open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
> file ($IF)!\n";

Do not see how this may be related: I see no way to inform the filter
about what is known in advance...

Thanks,
Ilya

Ilya Zakharevich

unread,

Dec 27, 2011, 4:19:00 PM12/27/11

to

On 2011-12-27, Ben Morrow <b...@morrow.me.uk> wrote:
> Encode::Guess, which can be invoked as
>
> open my $fh, '< :encoding(Guess)', $filename
>
> Somewhat annoyingly, you have to explicitly use Encode::Guess or it
> won't recognise the encoding name, and you have to use
> Encode::Guess->set_suspects to set the list of encodings to try.

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Thanks,
Ilya

Ben Morrow

unread,

Dec 27, 2011, 5:54:06 PM12/27/11

to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:

> On 2011-12-27, Ben Morrow <b...@morrow.me.uk> wrote:
> > Encode::Guess, which can be invoked as
> >
> > open my $fh, '< :encoding(Guess)', $filename
> >
> > Somewhat annoyingly, you have to explicitly use Encode::Guess or it
> > won't recognise the encoding name, and you have to use
> > Encode::Guess->set_suspects to set the list of encodings to try.
>
> Same question as to the other answer: does it ship with Perl?

~% corelist Encode::Guess

Encode::Guess was first released with perl v5.8.0

> And I do not want any guessing; I want a very deterministic
> procedure...

Well, given that all UTF-8 files are technically valid ISO8859-* files
as well some form of heuristic is necessary to distinguish them. E::G
uses a sensible series of tests: first it looks for the various Unicode
BOMs; then it checks for nul bytes and assumes one of the wide Unicode
encodings; then it tries decoding with the list of fallback encodings
you supply and uses the first one that succeeds. (Obviously something
like ISO8859-1 will always succeed, so it would need to be listed on its
own.)

Ben

r.mar...@fdcx.net

unread,

Dec 27, 2011, 10:59:59 PM12/27/11

to

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

Similar source available with activesatate for windows

Ilya Zakharevich

unread,

Dec 31, 2011, 4:52:43 PM12/31/11

to

On 2011-12-28, r.mar...@fdcx.net <r.mar...@fdcx.net> wrote:
>>Same question as to the other answer: does it ship with Perl? And I
>>do not want any guessing; I want a very deterministic procedure...

> Do as all perl mongers do - use CPAN to locate, download and install
> the needed function.
>
> $>perl -MCPAN -e shell

I never do "as all perl mongers do". Neither, I expect, do users of
my code.

Hope this helps,
Ilya

Tim McDaniel

unread,

Jan 1, 2012, 11:33:11 PM1/1/12

to

In article <7v4lf7lr7g2ro357f...@4ax.com>,

<r.mar...@fdcx.net> wrote:
>On Tue, 27 Dec 2011 21:19:00 +0000 (UTC), Ilya Zakharevich
><nospam...@ilyaz.org> wrote:
>>Same question as to the other answer: does it ship with Perl? And I
>>do not want any guessing; I want a very deterministic procedure...
>

>Do as all perl mongers do - use CPAN to locate, download and install
>the needed function.
>
>$>perl -MCPAN -e shell

I was a maintainer of servers at previous jobs and could do that for
the system. But not at my current job, and if I wanted to do it for a
shared script, I don't know yet how receptive they would be to a
request. It's why I "use constant" instead of a more modern and
convenient module.

--
Tim McDaniel, tm...@panix.com

tch...@perl.com

unread,

Feb 15, 2012, 4:02:57 PM2/15/12

to

Ilya,

I understand completely. I find that Encode::Guess is too unreliable for
my purposes. I have a replacement version that is built on a statistical
model derived from very large English-language corpora, which it gets
right 99.79% of the time, including on conflicting 8-bit encodings. For
example, it knows CP1252 from MacRoman from ISO-8859-1 from ISO-8859-15,
etc. I have a working alpha version of the code, so if you are interested in this
technique or wish to know more, please send me mail. You can fetch the
alpha version from

http://training.perl.com/scripts/Encode-Guess-Educated-0.03.tar.gz

I'm having trouble with my PAUSE id, so it isn't on CPAN yet.

Hope this helps, and do feel free to write. I never look here for anything,
so am likely to miss a reply.

--tom