use strict;
open F, '>:encoding(UTF-16LE)', "slask2.txt";
print F "1\n2\n3\n";
close F;
When I open the output in a hex editor I see
31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
I would expect to see:
31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
That is, I expect \n to be translated to 0D 00 0A 00, now it is translated
to three bytes.
It seems like I'm missing something basic, but the information is spread out
on several man-pages, and I have not been able to find where my error lies.
perl -v:
This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x86-
multi-thread (with 8 registered patches, see perl -V for more detail)
Copyright 1987-2010, Larry Wall
Binary build 1202 [293621] provided by ActiveState
http://www.ActiveState.com
Built Sep 6 2010 23:36:03
--
Erland Sommarskog, Stockholm, esq...@sommarskog.se
In other words (od -c):
1 \0 \r \n \0 2 \0 \r \n \0 3 \0 \r \n \0
> I would expect to see:
>
> 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
Guess you would even expect:
… 33 00 OD 00 OA 00
> That is, I expect \n to be translated to 0D 00 0A 00, now it is
> translated to three bytes.
It looks like a bug to me. I'm getting the same result as you for:
* ActivePerl 5.10.1
* ActivePerl 5.12.1
* Strawberry 5.12.0
All three participants show correspondingly wrong results for UTF-16BE.
And also for UTF-16, which just adds the BOM.
Perl/Cygwin 5.10.1 does fine because its OS is "cygwin", so it doesn't
translate "\n" to CRLF.
--
Michael Ludwig
You need to stack the I/O layers in the right order. The :encoding() layer
needs to come last (be at the bottom of the stack), *after* the :crlf layer
adds the additional carriage returns. The way to pop the default :crlf
layer is to start out with the :raw pseudo-layer:
open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
Cheers,
-Jan
> You need to stack the I/O layers in the right order. The :encoding()
> layer needs to come last (be at the bottom of the stack), *after* the
> :crlf layer adds the additional carriage returns. The way to pop the
> default :crlf layer is to start out with the :raw pseudo-layer:
>
> open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
Cool, that works. thanks! :-)
--
Michael Ludwig
Certainly not anywhere close to intuitive. And the explanation is even more
muddy. "Needs to come last" - it is smack in the middle. "after the :crlf
layer" - it comes before.
One can sense some potential for improvements. Not the least in the
documentation area.
The explanation makes sense; so much so that I overlooked the fact that
this is simply not how it works. Luckily, you were being vigilant. :-)
What I can imagine is that handling the logical entity \n is a some sort
of a post-processing step, which would explain why it needs to come last.
Here's a short demo script to show various layer combinations and how
they go wrong:
\,,,/
(o o)
------oOOo-(_)-oOOo------
use strict;
my $str = "1\n2\n3\n"; # string to print
my $fno = 1; # counter for filenames
sub out {
my $fn = sprintf 'u%02u-%s.txt', $fno++, (join '-', @_) || 'NONE';
my $layers = join '', map ":$_", @_;
printf STDERR "%30s => %-40s\n", $layers, $fn;
open my $fh, ">$layers", $fn or die "open $fn: $!";
print $fh $str;
close $fh;
}
my $e = 'encoding(UTF-16LE)';
my $r = 'raw';
my $n = 'crlf';
out; # default layers
out $r; # reset default layers
out $r, $n; # same as default on Windows
out $n, $r; # :raw at the end resets *all* layers
out $e, $r; # ditto
out $n, $e, $r; # ditto
out $e, $n, $r; # ditto
out $r, $e, $n; # appears illogical, but correct result
out $r, $n, $e; # appears logical, but wrong result
out $e, $n;
out $n, $e;
out $n, $r, $e; # :crlf reset
--
Michael Ludwig
Would you mind explaining how it is *not* working the way I
described it above? I realize that the fact that layers work
as a "stack" may be confusing, which is why I annotated "last"
with "bottom of the stack". Of course the one last on the stack
is the first in the list of layers passed to open() because stacks
are LIFO (last in/first out):
:raw - clears the existing :crlf layer from the stack
could have used :pop instead, but :raw is more robust
:encoding(UTF-16LE) - pushes the :encoding layer to the stack. This makes
it the last layer on the stack (and also still the
first, for now).
:crlf - pushes the :crlf layer on the stack. :encoding is
still the last layer, but :crlf is now the first.
Now when you print a string to the filehandle, then it will be passed
to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
string, and then passes it on to the next lower layer :encoding, which
will do the encoding, and when it reaches the bottom of the stack the
data is actually written to the filesystem.
Files opened on Windows already have the :crlf layer pushed by default,
so you somehow need to get the :encoding layer *below* it. If
you have it on top, then the crlf substitution happens *after* the
encoding, leading to incorrect data.
Cheers,
-Jan
This is open source. Patches welcome! This is how things get better.
Cheers,
-Jan
Sorry - it works exactly the way you described above. I didn't read
properly. I got confused by the uniform look of real and pseudo layers.
The :raw pseudo layer is not a layer, but rather, as you write, an
instruction to clear the stack, like this:
:raw -> clear()
:encoding(UTF-16LE) -> push( encoding(UTF-16LE) )
:crlf -> push( crlf )
I was *wrongly* thinking this, as if :raw were another layer, and not a
clearing instruction:
:raw -> push( raw ) # wrong!
:encoding(UTF-16LE) -> push( encoding(UTF-16LE) )
:crlf -> push( crlf )
Regarding your explanation:
> I realize that the fact that layers work as a "stack" may be
> confusing, which is why I annotated "last" with "bottom of the stack".
> Of course the one last on the stack is the first in the list of layers
> passed to open() because stacks are LIFO (last in/first out):
>
> :raw - clears the existing :crlf layer from the stack
> could have used :pop instead, but :raw is more robust
>
> :encoding(UTF-16LE) - pushes the :encoding layer to the stack. This makes
> it the last layer on the stack (and also still the
> first, for now).
>
> :crlf - pushes the :crlf layer on the stack. :encoding is
> still the last layer, but :crlf is now the first.
>
> Now when you print a string to the filehandle, then it will be passed
> to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
> string, and then passes it on to the next lower layer :encoding, which
> will do the encoding, and when it reaches the bottom of the stack the
> data is actually written to the filesystem.
>
> Files opened on Windows already have the :crlf layer pushed by default,
> so you somehow need to get the :encoding layer *below* it. If
> you have it on top, then the crlf substitution happens *after* the
> encoding, leading to incorrect data.
I think you've clarified it for all eternity.
What would be the best place to add your explanation to the docs?
http://perldoc.perl.org/functions/binmode.html
http://perldoc.perl.org/functions/open.html
http://perldoc.perl.org/perlunicode.html
http://perldoc.perl.org/PerlIO.html
Judging from existing content, I think PerlIO would be a good place for
this addition. It already has a lot of great information. However, it
starts going medias in res instead of first providing an overview and
introducing the stack picture. This could be improved.
On the downside, it is buried in the Modules Section. And the title [1]
is just too technical and might scare novice readers away.
Can you think of a better place for your user-friendly doc addition? You
obviously know the docs far better than I do … :-)
[1] PerlIO - On demand loader for PerlIO layers
and root of PerlIO::* name space
--
Michael Ludwig
Files opened on Windows already have the :crlf layer pushed by default, so you somehow need to get the :encoding layer *below* it.
open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
0D 00 0A
But the way you describe it, I would expect it to be
0D 0A 00
That is, first the string is encoded in UTF-16LE and the newline gets
expanded from 0A to 0A 00.
Next, the crlf layer jumps in and blindly adds a carriage return, but
somehow it does manage to get the \r character correct nevertheless, but
loses the high byte of the \n.
I went back to the very first message in the thread, where you write:
| When I open the output in a hex editor I see
|
| 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
|
| I would expect to see:
|
| 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
|
| That is, I expect \n to be translated to 0D 00 0A 00, now it is translated
| to three bytes.
( from http://code.activestate.com/lists/perl-unicode/3256/ )
So it looks like what you saw is exactly what you expect to see
based on my explanation. :)
I couldn't find any example where you had "\r\0\n" as a line ending.
Cheers,
-Jan
Depends on what you mean by “correctly”. It does work correctly as-is, creating output encoded as UTF-16LE with CR/LF line endings. If you want different layers on different operating systems, then you will need to tell the interpreter what exactly it is you want. Which means you probably have to look at $^O. Assuming you don’t want the :crlf layer on non-Windows systems it is as easy as:
open(my $fh, “:raw:encoding(UTF-16LE)”, $filename) or die $!;
binmode($fh, “:crlf”) if $^O eq “MSWin32”;
Cheers,
-Jan
PS: I saw some discussion today that the :raw pseudo-layer in the open() call will also remove the buffering layer (it doesn’t do that when you use it in a binmode() call). I’ll try to remember to send a followup once I actually understand what is going on.
That seems indeed to be the case right now. The bug is filed here:
http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764
A workaround is to use ":raw:perlio" instead of ":raw" to turn to
binmode without losing the buffering.
Cheers,
-Jan
use strict;
open F, '>slask.out';
binmode(F, ':raw:encoding(UTF16-LE):crlf');
print F "Alfa\nBeta\nGamma\n";
Looking at the file in a binary editor, I see:
41 00 6C 00 66 00 61 00 0D 0A 00 42 00 65 00 74
00 61 00 0D 0A 00 47 00 61 00 6D 00 6D 00 61 00
0D 0A 00
In total 35 bytes. Which is a very odd number for a UTF16 file.
I've double-checked with Leon, who thinks that this is due to bug 38456:
http://rt.perl.org/rt3//Public/Bug/Display.html?id=38456
He made a patch to fix the bug, and the patch has been applied to
bleadperl already. I ran you sample script with 5.13.9 plus his
patch, and it generates a correct 38 bytes file. I'm not sure
if this change could/should be picked for a 5.12.4 release as
well, but I guess it probably won't. But 5.14 should be out
in April or May anyways...
It looks like there is still a lot of brokenness lurking in
the internals of the Perl I/O layer implementation. :(
Cheers,
-Jan
Yes, there certainly seems to be some more stuff to do in the Unicode
support in Perl. For instance, support for Unicode filenames in open or
opendir.
> Yes, there certainly seems to be some more stuff to do in the Unicode
> support in Perl. For instance, support for Unicode filenames in open
> or opendir.
I think there is no portable answer here, as it depends on the
filesystem's support for Unicode.
Or what exactly are you referring to?
--
Michael Ludwig
Obviously, Unicode cannot happen on systems which do not support Unicode.
For instance, I use Windows exclusively, so Unicode in file names is no
problem. On the other hand, it's a dead case for system() and backticks
as far as I can make out. (That is, I have not been able to run Unicode
BAT files.)
No … but Perl did. :-)
> For instance, I use Windows exclusively, so Unicode in file names is
> no problem.
Did a quick test:
\,,,/
(o o)
------oOOo-(_)-oOOo------
use strict;
use warnings;
use utf8;
my $fn = 'a…b.txt'; # mit Unicode-Zeichen
open my $fh, '>:encoding(UTF-8)', $fn or die "open $fn: $!";
print $fh "$fn\n";
close $fh;
-------------------------
v5.10.1 (*) built for i686-cygwin-thread-multi-64int
* a…b.txt
* correct (in Explorer, cmd.exe, MinTTY)
* has: CYG17 utf8-paths (which might be responsible)
(v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)
* a…b.txt
* not correct
* doesn't have anything with "uni" or "utf" in "perl -V"
--
Michael Ludwig
Not that it is a terribly big deal. In the program where I want to
support Unicode names, I've already written a module around Win32API::File,
which permits to open a file in Windows, and the associate it with
a file handle.