Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

encoding(UTF16-LE) on Windows

43 views
Skip to first unread message

Erland Sommarskog

unread,
Jan 17, 2011, 8:57:43 AM1/17/11
to perl-u...@perl.org
I'm on Windows and I have this small script:

use strict;
open F, '>:encoding(UTF-16LE)', "slask2.txt";
print F "1\n2\n3\n";
close F;

When I open the output in a hex editor I see

31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00

I would expect to see:

31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00

That is, I expect \n to be translated to 0D 00 0A 00, now it is translated
to three bytes.

It seems like I'm missing something basic, but the information is spread out
on several man-pages, and I have not been able to find where my error lies.

perl -v:

This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x86-
multi-thread (with 8 registered patches, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Binary build 1202 [293621] provided by ActiveState
http://www.ActiveState.com
Built Sep 6 2010 23:36:03

--
Erland Sommarskog, Stockholm, esq...@sommarskog.se

Michael Ludwig

unread,
Jan 19, 2011, 5:11:41 AM1/19/11
to perl-u...@perl.org
Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-0000):
> I'm on Windows and I have this small script:
>
> use strict;
> open F, '>:encoding(UTF-16LE)', "slask2.txt";
> print F "1\n2\n3\n";
> close F;
>
> When I open the output in a hex editor I see
>
> 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00

In other words (od -c):

1 \0 \r \n \0 2 \0 \r \n \0 3 \0 \r \n \0

> I would expect to see:
>
> 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00

Guess you would even expect:

… 33 00 OD 00 OA 00

> That is, I expect \n to be translated to 0D 00 0A 00, now it is
> translated to three bytes.

It looks like a bug to me. I'm getting the same result as you for:

* ActivePerl 5.10.1
* ActivePerl 5.12.1
* Strawberry 5.12.0

All three participants show correspondingly wrong results for UTF-16BE.
And also for UTF-16, which just adds the BOM.

Perl/Cygwin 5.10.1 does fine because its OS is "cygwin", so it doesn't
translate "\n" to CRLF.

--
Michael Ludwig

Jan Dubois

unread,
Jan 19, 2011, 2:08:30 PM1/19/11
to Michael Ludwig, perl-u...@perl.org
On Wed, 19 Jan 2011, Michael Ludwig wrote:
> Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-0000):
> > I'm on Windows and I have this small script:
> >
> > use strict;
> > open F, '>:encoding(UTF-16LE)', "slask2.txt";
> > print F "1\n2\n3\n";
> > close F;
> >
> > When I open the output in a hex editor I see
> >
> > 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
>
>
> It looks like a bug to me. I'm getting the same result as you for:
>
> * ActivePerl 5.10.1
> * ActivePerl 5.12.1
> * Strawberry 5.12.0
>
> All three participants show correspondingly wrong results for UTF-16BE.
> And also for UTF-16, which just adds the BOM.
>
> Perl/Cygwin 5.10.1 does fine because its OS is "cygwin", so it doesn't
> translate "\n" to CRLF.

You need to stack the I/O layers in the right order. The :encoding() layer
needs to come last (be at the bottom of the stack), *after* the :crlf layer
adds the additional carriage returns. The way to pop the default :crlf
layer is to start out with the :raw pseudo-layer:

open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;

Cheers,
-Jan

Michael Ludwig

unread,
Jan 19, 2011, 2:45:27 PM1/19/11
to perl-u...@perl.org
Jan Dubois schrieb am 19.01.2011 um 11:08 (-0800):

> You need to stack the I/O layers in the right order. The :encoding()
> layer needs to come last (be at the bottom of the stack), *after* the
> :crlf layer adds the additional carriage returns. The way to pop the
> default :crlf layer is to start out with the :raw pseudo-layer:
>
> open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;

Cool, that works. thanks! :-)

--
Michael Ludwig

Erland Sommarskog

unread,
Jan 20, 2011, 3:29:05 AM1/20/11
to perl-u...@perl.org
"Jan Dubois" (ja...@activestate.com) writes:
> You need to stack the I/O layers in the right order. The :encoding()
> layer needs to come last (be at the bottom of the stack), *after* the
> :crlf layer adds the additional carriage returns. The way to pop the
> default :crlf layer is to start out with the :raw pseudo-layer:
>
> open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;

Certainly not anywhere close to intuitive. And the explanation is even more
muddy. "Needs to come last" - it is smack in the middle. "after the :crlf
layer" - it comes before.

One can sense some potential for improvements. Not the least in the
documentation area.

Michael Ludwig

unread,
Jan 20, 2011, 9:52:38 AM1/20/11
to perl-u...@perl.org
Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-0000):
> "Jan Dubois" (ja...@activestate.com) writes:
> > You need to stack the I/O layers in the right order. The :encoding()
> > layer needs to come last (be at the bottom of the stack), *after* the
> > :crlf layer adds the additional carriage returns. The way to pop the
> > default :crlf layer is to start out with the :raw pseudo-layer:
> >
> > open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
>
> Certainly not anywhere close to intuitive. And the explanation is even
> more muddy. "Needs to come last" - it is smack in the middle. "after
> the :crlf layer" - it comes before.

The explanation makes sense; so much so that I overlooked the fact that
this is simply not how it works. Luckily, you were being vigilant. :-)

What I can imagine is that handling the logical entity \n is a some sort
of a post-processing step, which would explain why it needs to come last.

Here's a short demo script to show various layer combinations and how
they go wrong:

\,,,/
(o o)
------oOOo-(_)-oOOo------
use strict;

my $str = "1\n2\n3\n"; # string to print
my $fno = 1; # counter for filenames

sub out {
my $fn = sprintf 'u%02u-%s.txt', $fno++, (join '-', @_) || 'NONE';
my $layers = join '', map ":$_", @_;
printf STDERR "%30s => %-40s\n", $layers, $fn;
open my $fh, ">$layers", $fn or die "open $fn: $!";
print $fh $str;
close $fh;
}

my $e = 'encoding(UTF-16LE)';
my $r = 'raw';
my $n = 'crlf';

out; # default layers
out $r; # reset default layers
out $r, $n; # same as default on Windows
out $n, $r; # :raw at the end resets *all* layers
out $e, $r; # ditto
out $n, $e, $r; # ditto
out $e, $n, $r; # ditto
out $r, $e, $n; # appears illogical, but correct result
out $r, $n, $e; # appears logical, but wrong result
out $e, $n;
out $n, $e;
out $n, $r, $e; # :crlf reset

--
Michael Ludwig

Jan Dubois

unread,
Jan 20, 2011, 3:45:30 PM1/20/11
to Michael Ludwig, perl-u...@perl.org
On Thu, 20 Jan 2011, Michael Ludwig wrote:
> Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-0000):
> > "Jan Dubois" (ja...@activestate.com) writes:
> > > You need to stack the I/O layers in the right order. The :encoding()
> > > layer needs to come last (be at the bottom of the stack), *after* the
> > > :crlf layer adds the additional carriage returns. The way to pop the
> > > default :crlf layer is to start out with the :raw pseudo-layer:
> > >
> > > open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
> >
> > Certainly not anywhere close to intuitive. And the explanation is even
> > more muddy. "Needs to come last" - it is smack in the middle. "after
> > the :crlf layer" - it comes before.
>
> The explanation makes sense; so much so that I overlooked the fact that
> this is simply not how it works. Luckily, you were being vigilant. :-)

Would you mind explaining how it is *not* working the way I
described it above? I realize that the fact that layers work
as a "stack" may be confusing, which is why I annotated "last"
with "bottom of the stack". Of course the one last on the stack
is the first in the list of layers passed to open() because stacks
are LIFO (last in/first out):

:raw - clears the existing :crlf layer from the stack
could have used :pop instead, but :raw is more robust

:encoding(UTF-16LE) - pushes the :encoding layer to the stack. This makes
it the last layer on the stack (and also still the
first, for now).

:crlf - pushes the :crlf layer on the stack. :encoding is
still the last layer, but :crlf is now the first.

Now when you print a string to the filehandle, then it will be passed
to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
string, and then passes it on to the next lower layer :encoding, which
will do the encoding, and when it reaches the bottom of the stack the
data is actually written to the filesystem.

Files opened on Windows already have the :crlf layer pushed by default,
so you somehow need to get the :encoding layer *below* it. If
you have it on top, then the crlf substitution happens *after* the
encoding, leading to incorrect data.

Cheers,
-Jan


Jan Dubois

unread,
Jan 20, 2011, 3:46:48 PM1/20/11
to Erland Sommarskog, perl-u...@perl.org
On Thu, 20 Jan 2011, Erland Sommarskog wrote:
> One can sense some potential for improvements. Not the least in the
> documentation area.

This is open source. Patches welcome! This is how things get better.

Cheers,
-Jan

Michael Ludwig

unread,
Jan 20, 2011, 5:10:56 PM1/20/11
to perl-u...@perl.org
[RE: encoding(UTF16-LE) on Windows]

Jan Dubois schrieb am 20.01.2011 um 12:45 (-0800):
> On Thu, 20 Jan 2011, Michael Ludwig wrote:
> > Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-0000):
> > > "Jan Dubois" (ja...@activestate.com) writes:
> > > > You need to stack the I/O layers in the right order. The :encoding()
> > > > layer needs to come last (be at the bottom of the stack), *after* the
> > > > :crlf layer adds the additional carriage returns. The way to pop the
> > > > default :crlf layer is to start out with the :raw pseudo-layer:
> > > >
> > > > open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
> > >
> > > Certainly not anywhere close to intuitive. And the explanation is even
> > > more muddy. "Needs to come last" - it is smack in the middle. "after
> > > the :crlf layer" - it comes before.
> >
> > The explanation makes sense; so much so that I overlooked the fact that
> > this is simply not how it works. Luckily, you were being vigilant. :-)
>
> Would you mind explaining how it is *not* working the way I
> described it above?

Sorry - it works exactly the way you described above. I didn't read
properly. I got confused by the uniform look of real and pseudo layers.
The :raw pseudo layer is not a layer, but rather, as you write, an
instruction to clear the stack, like this:

:raw -> clear()
:encoding(UTF-16LE) -> push( encoding(UTF-16LE) )
:crlf -> push( crlf )

I was *wrongly* thinking this, as if :raw were another layer, and not a
clearing instruction:

:raw -> push( raw ) # wrong!
:encoding(UTF-16LE) -> push( encoding(UTF-16LE) )
:crlf -> push( crlf )

Regarding your explanation:

> I realize that the fact that layers work as a "stack" may be
> confusing, which is why I annotated "last" with "bottom of the stack".
> Of course the one last on the stack is the first in the list of layers
> passed to open() because stacks are LIFO (last in/first out):
>
> :raw - clears the existing :crlf layer from the stack
> could have used :pop instead, but :raw is more robust
>
> :encoding(UTF-16LE) - pushes the :encoding layer to the stack. This makes
> it the last layer on the stack (and also still the
> first, for now).
>
> :crlf - pushes the :crlf layer on the stack. :encoding is
> still the last layer, but :crlf is now the first.
>
> Now when you print a string to the filehandle, then it will be passed
> to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
> string, and then passes it on to the next lower layer :encoding, which
> will do the encoding, and when it reaches the bottom of the stack the
> data is actually written to the filesystem.
>
> Files opened on Windows already have the :crlf layer pushed by default,
> so you somehow need to get the :encoding layer *below* it. If
> you have it on top, then the crlf substitution happens *after* the
> encoding, leading to incorrect data.

I think you've clarified it for all eternity.

What would be the best place to add your explanation to the docs?

http://perldoc.perl.org/functions/binmode.html
http://perldoc.perl.org/functions/open.html
http://perldoc.perl.org/perlunicode.html
http://perldoc.perl.org/PerlIO.html

Judging from existing content, I think PerlIO would be a good place for
this addition. It already has a lot of great information. However, it
starts going medias in res instead of first providing an overview and
introducing the stack picture. This could be improved.

On the downside, it is buried in the Modules Section. And the title [1]
is just too technical and might scare novice readers away.

Can you think of a better place for your user-friendly doc addition? You
obviously know the docs far better than I do … :-)

[1] PerlIO - On demand loader for PerlIO layers
and root of PerlIO::* name space

--
Michael Ludwig

Bob Hallissy

unread,
Jan 20, 2011, 10:40:09 PM1/20/11
to perl-u...@perl.org

Jan Dubois wrote:
Files opened on Windows already have the :crlf layer pushed by default,
so you somehow need to get the :encoding layer *below* it.

Is it possible to re-write the working statement
  open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;
in a way that works correctly on any platform (without referring to $^O) ?

Bob

Erland Sommarskog

unread,
Jan 21, 2011, 3:38:41 AM1/21/11
to perl-u...@perl.org
"Jan Dubois" (ja...@activestate.com) writes:
> Now when you print a string to the filehandle, then it will be passed
> to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
> string, and then passes it on to the next lower layer :encoding, which
> will do the encoding, and when it reaches the bottom of the stack the
> data is actually written to the filesystem.
>
> Files opened on Windows already have the :crlf layer pushed by default,
> so you somehow need to get the :encoding layer *below* it. If
> you have it on top, then the crlf substitution happens *after* the
> encoding, leading to incorrect data.

There is still one thing that is not clear to me. The incorrect end-of-line
was

0D 00 0A

But the way you describe it, I would expect it to be

0D 0A 00

That is, first the string is encoded in UTF-16LE and the newline gets
expanded from 0A to 0A 00.

Next, the crlf layer jumps in and blindly adds a carriage return, but
somehow it does manage to get the \r character correct nevertheless, but
loses the high byte of the \n.

Jan Dubois

unread,
Jan 21, 2011, 2:48:40 PM1/21/11
to Erland Sommarskog, perl-u...@perl.org
On Fri, 21 Jan 2011, Erland Sommarskog wrote:
>
> There is still one thing that is not clear to me. The incorrect end-of-line
> was
>
> 0D 00 0A
>
> But the way you describe it, I would expect it to be
>
> 0D 0A 00

I went back to the very first message in the thread, where you write:

| When I open the output in a hex editor I see
|
| 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
|

| I would expect to see:
|
| 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
|

| That is, I expect \n to be translated to 0D 00 0A 00, now it is translated
| to three bytes.

( from http://code.activestate.com/lists/perl-unicode/3256/ )

So it looks like what you saw is exactly what you expect to see
based on my explanation. :)

I couldn't find any example where you had "\r\0\n" as a line ending.

Cheers,
-Jan


Jan Dubois

unread,
Jan 21, 2011, 4:32:27 PM1/21/11
to Bob Hallissy, perl-u...@perl.org

Depends on what you mean by “correctly”.  It does work correctly as-is, creating output encoded as UTF-16LE with CR/LF line endings.  If you want different layers on different operating systems, then you will need to tell the interpreter what exactly it is you want. Which means you probably have to look at $^O.  Assuming you don’t want the :crlf layer on non-Windows systems it is as easy as:

 

                open(my $fh, “:raw:encoding(UTF-16LE)”, $filename) or die $!;

     binmode($fh, “:crlf”) if $^O eq “MSWin32”;

 

Cheers,

-Jan

 

PS: I saw some discussion today that the :raw pseudo-layer in the open() call will also remove the buffering layer (it doesn’t do that when you use it in a binmode() call). I’ll try to remember to send a followup once I actually understand what is going on.

Jan Dubois

unread,
Jan 21, 2011, 6:04:17 PM1/21/11
to perl-u...@perl.org
I wrote:
> I saw some discussion today that the :raw pseudo-layer in the open()
> call will also remove the buffering layer (it doesn’t do that when you
> use it in a binmode() call). I’ll try to remember to send a followup
> once I actually understand what is going on.

That seems indeed to be the case right now. The bug is filed here:

http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764

A workaround is to use ":raw:perlio" instead of ":raw" to turn to
binmode without losing the buffering.

Cheers,
-Jan


Erland Sommarskog

unread,
Jan 21, 2011, 10:56:48 AM1/21/11
to perl-u...@perl.org
"Jan Dubois" (ja...@activestate.com) writes:
> You need to stack the I/O layers in the right order. The :encoding()
> layer needs to come last (be at the bottom of the stack), *after* the
> :crlf layer adds the additional carriage returns. The way to pop the
> default :crlf layer is to start out with the :raw pseudo-layer:
>
> open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;

So this works. But this does not:

use strict;

open F, '>slask.out';
binmode(F, ':raw:encoding(UTF16-LE):crlf');
print F "Alfa\nBeta\nGamma\n";

Looking at the file in a binary editor, I see:

41 00 6C 00 66 00 61 00 0D 0A 00 42 00 65 00 74
00 61 00 0D 0A 00 47 00 61 00 6D 00 6D 00 61 00
0D 0A 00

In total 35 bytes. Which is a very odd number for a UTF16 file.

Jan Dubois

unread,
Jan 28, 2011, 7:55:31 PM1/28/11
to Erland Sommarskog, perl-u...@perl.org
On Fri, 21 Jan 2011, Erland Sommarskog wrote:

I've double-checked with Leon, who thinks that this is due to bug 38456:

http://rt.perl.org/rt3//Public/Bug/Display.html?id=38456

He made a patch to fix the bug, and the patch has been applied to
bleadperl already. I ran you sample script with 5.13.9 plus his
patch, and it generates a correct 38 bytes file. I'm not sure
if this change could/should be picked for a 5.12.4 release as
well, but I guess it probably won't. But 5.14 should be out
in April or May anyways...

It looks like there is still a lot of brokenness lurking in
the internals of the Perl I/O layer implementation. :(

Cheers,
-Jan

Erland Sommarskog

unread,
Jan 29, 2011, 8:02:33 AM1/29/11
to perl-u...@perl.org
"Jan Dubois" (ja...@activestate.com) writes:
> I've double-checked with Leon, who thinks that this is due to bug 38456:
>
> http://rt.perl.org/rt3//Public/Bug/Display.html?id=38456
>
> He made a patch to fix the bug, and the patch has been applied to
> bleadperl already. I ran you sample script with 5.13.9 plus his
> patch, and it generates a correct 38 bytes file. I'm not sure
> if this change could/should be picked for a 5.12.4 release as
> well, but I guess it probably won't. But 5.14 should be out
> in April or May anyways...
>
> It looks like there is still a lot of brokenness lurking in
> the internals of the Perl I/O layer implementation. :(

Thanks for the update, Jan.

Yes, there certainly seems to be some more stuff to do in the Unicode
support in Perl. For instance, support for Unicode filenames in open or
opendir.

Michael Ludwig

unread,
Jan 30, 2011, 1:02:34 PM1/30/11
to perl-u...@perl.org
Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):

> Yes, there certainly seems to be some more stuff to do in the Unicode
> support in Perl. For instance, support for Unicode filenames in open
> or opendir.

I think there is no portable answer here, as it depends on the
filesystem's support for Unicode.

Or what exactly are you referring to?

--
Michael Ludwig

Erland Sommarskog

unread,
Jan 31, 2011, 5:42:43 PM1/31/11
to perl-u...@perl.org
Michael Ludwig (mil...@gmx.de) writes:
> Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):
>
>> Yes, there certainly seems to be some more stuff to do in the Unicode
>> support in Perl. For instance, support for Unicode filenames in open
>> or opendir.
>
> I think there is no portable answer here, as it depends on the
> filesystem's support for Unicode.

Did I say it have to be portable? :-)

Obviously, Unicode cannot happen on systems which do not support Unicode.

For instance, I use Windows exclusively, so Unicode in file names is no
problem. On the other hand, it's a dead case for system() and backticks
as far as I can make out. (That is, I have not been able to run Unicode
BAT files.)

Michael Ludwig

unread,
Jan 31, 2011, 8:32:37 PM1/31/11
to perl-u...@perl.org
Erland Sommarskog schrieb am 31.01.2011 um 23:42 (+0100):
> Michael Ludwig (mil...@gmx.de) writes:
> > Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):
> >
> >> Yes, there certainly seems to be some more stuff to do in the
> >> Unicode support in Perl. For instance, support for Unicode
> >> filenames in open or opendir.
> >
> > I think there is no portable answer here, as it depends on the
> > filesystem's support for Unicode.
>
> Did I say it have to be portable? :-)

No … but Perl did. :-)

> For instance, I use Windows exclusively, so Unicode in file names is
> no problem.

Did a quick test:

\,,,/
(o o)
------oOOo-(_)-oOOo------
use strict;

use warnings;
use utf8;
my $fn = 'a…b.txt'; # mit Unicode-Zeichen
open my $fh, '>:encoding(UTF-8)', $fn or die "open $fn: $!";
print $fh "$fn\n";
close $fh;
-------------------------

v5.10.1 (*) built for i686-cygwin-thread-multi-64int

* a…b.txt
* correct (in Explorer, cmd.exe, MinTTY)
* has: CYG17 utf8-paths (which might be responsible)

(v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)

* a…b.txt
* not correct
* doesn't have anything with "uni" or "utf" in "perl -V"

--
Michael Ludwig

Erland Sommarskog

unread,
Feb 1, 2011, 5:04:20 PM2/1/11
to perl-u...@perl.org
Michael Ludwig (mil...@gmx.de) writes:
>> For instance, I use Windows exclusively, so Unicode in file names is
>> no problem.
>
> Did a quick test:
>
> (v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)
>
> * a…b.txt
> * not correct
> * doesn't have anything with "uni" or "utf" in "perl -V"

OK, so the implementation would have to know that on this platform
filenames are in UTF-16, on this it is UTF-8 and so on.

Not that it is a terribly big deal. In the program where I want to
support Unicode names, I've already written a module around Win32API::File,
which permits to open a file in Windows, and the associate it with
a file handle.

0 new messages