On 2011-12-26,
r.mar...@fdcx.net <
r.mar...@fdcx.net> wrote:
>>Does Perl ship with a simple method of opening Unicode files? E.g., I
>>would like to have something like
>>
>> open my $fh, '< :BOM0or(utf8)', $filename
>>
>>where BOM0or does what Perl itself does for Perl files: it looks for the
>>first 4 bytes; given that a Perl file starts in ASCII, one can detect
>>BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
>>is none of the above (then the arument in parens explains what to do;
>>e.g., Perl itself does BOM0or(latin1)).
Thinking about it more, there are 3 situations:
a) we know that the first 2 characters in the file are 7-bit, and
are not 0. Then read the first 2 bytes; if both 0, it is 32BE
(possibly with [hardly legal] BOM); if BOM-BE, it is 16BE+BOM; if
high bits are set, it is UTF-8+BOM; if the first byte is 0, it is
16BE.
One needs to read the other 2 bytes only if 32BE is detected (and
only if one wants to guard against BOM) and if the second byte is
0 - then it may be 16LE or 32LE.
The only possible confusion is whether the file is actually in
Unicode encoding, or in an 8-bit encoding (or between UTF-7 and
UTF-8-no-BOMs).
b) The only thing known is that the first 2 chars are not 0. Again,
one reads 2 bytes - but now there is no way to detect UTF-8-BOM.
c) The only thing known is that the fist 2 chars are 7-bit. Then
there is no way to detect BOMless UTF-16.
d) General case: 8-bit chars may be present.
It looks like the decision algorithms are DIFFERENT in these 4 cases;
hence one needs 4 different "filters": One can call them BOM07, BOM08,
BOM7, and BOM8.
> Here's what I use and it seems to do what's needed:
>
> use File::BOM qw( :all );
And do you know from which version it is shipped with Perl?
> # Open specified input file
> open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
> file ($IF)!\n";
Do not see how this may be related: I see no way to inform the filter
about what is known in advance...
Thanks,
Ilya