Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

select a variable as stdout and utf8 flag behaviour

6 views
Skip to first unread message

Gert Brinkmann

unread,
Nov 9, 2016, 10:00:02 AM11/9/16
to perl-u...@perl.org
Hello,

I have the following example code:

-----------------------------------------------------
use strict;
use utf8;
use Encode;
use FileHandle;

binmode STDOUT, ":utf8";

my $html = '';

#-- open filehandle to write into the $html variable as utf8
open(my $fh, '>:encoding(UTF-8)', \$html);
my $orig_stdout = select( $fh );


print "Ümläut Test ßaß; 使用下列语言\n";


select( $orig_stdout );
$fh->close();

#You need to activate this line to make utf8 output correct
#Encode::_utf8_on($html);

print $html;
-----------------------------------------------------


This prints out the utf8 characters corrupted. You have to flag the
Variable after writing into it with Encode::_utf8_on() as utf8 to make
it work correctly. (So activate the commented line.)

Using this _utf8_on() usually means that I am doing something wrong. Is
there a better way to achieve the correct behaviour?


Btw. there was a change in the behaviour between perl v5.14.2 and
v5.20.2: In older perl versions you could do a

my $html = '';
Encode::_utf8_on($html);

before opening the file handle onto this variable. In newer perl
versions the utf8 flag is reset on open() and print() to the variable's
file handle.

Greetings
Gert

pa...@cpan.org

unread,
Nov 9, 2016, 10:30:02 AM11/9/16
to Gert Brinkmann, perl-u...@perl.org
On Wednesday 09 November 2016 15:55:47 Gert Brinkmann wrote:
> Hello,
>
...
>
> This prints out the utf8 characters corrupted. You have to flag the
> Variable after writing into it with Encode::_utf8_on() as utf8 to make
> it work correctly. (So activate the commented line.)
>
> Using this _utf8_on() usually means that I am doing something wrong.

Yes, that is truth! You should never use _utf8_on/_utf8_off/is_utf8
functions! They are here *only* for dealing with buggy XS modules. Not
for pure perl code... In pure perl code you must *not* care about UTF8
flag.

> Is there a better way to achieve the correct behaviour?

Of course! When you think that you need to use Encode::_utf8_on() then
use utf8::decode() instead (or Encode::decode('UTF-8', ...)). Similarly
utf8::encode (or Encode::encode('UTF-8, ...)) instead of
Encode::_utf8_off().

> Btw. there was a change in the behaviour between perl v5.14.2 and
> v5.20.2: In older perl versions you could do a
>
> my $html = '';
> Encode::_utf8_on($html);
>
> before opening the file handle onto this variable. In newer perl
> versions the utf8 flag is reset on open() and print() to the variable's
> file handle.

UTF8 flag just indicate if internal encoding of perl scalar is Latin1 or
UTF8. But it is internal any Latin1 string can be represented either in
Latin1 (without UTF8 flag) or in UTF-8 (with UTF8 flag). You should not
care about internal representation in pure perl code. Any perl function
at any time can convert scalar between these two encoding if it is
possible (for ASCII and Latin1 charsets).

(Btw, on EBCDIC platforms, UTF8 flag indicate that internal encoding is
UTFEBCDIC or EBCDIC, not UTF-8!!, so really do not depend on UTF8 flag!)

And to your question, here is explanation of your source code:

> -----------------------------------------------------
> use strict;
> use utf8;

Now source code is expected to be in utf8 and perl strings are treated
as wide characters.

> use Encode;
> use FileHandle;
>
> binmode STDOUT, ":utf8";

Now printing to STDOUT handle accept wide characters (>= 0xFF) and
convert output to utf8 octets. So your terminal should be configured to
accept and show UTF-8 sequences correctly.

>
> my $html = '';
>
> #-- open filehandle to write into the $html variable as utf8
> open(my $fh, '>:encoding(UTF-8)', \$html);

Now printing to $fh accept wide characters and convert printed
characters to utf8 octets before storing them to $html. It means that
$html will *always* contains sequence of numbers which represent utf8
sequences.

> my $orig_stdout = select( $fh );
>
>
> print "Ümläut Test ßaß; 使用下列语言\n";

Now you have string with wide characters and this print will send this
string to $html. In $html you have sequence of octets which contains
encoded form of that wide string.

>
>
> select( $orig_stdout );
> $fh->close();
>
> #You need to activate this line to make utf8 output correct
> #Encode::_utf8_on($html);
>
> print $html;

And now you send sequence of utf8 octets to STDOUT which expect wide
characters those are converted to utf8 octets. So what you get is double
encoded utf8 sequence.

Now stop, and think about it why this is truth!

> -----------------------------------------------------

Fix is really simple. Either decode utf8 octets in $html back to wide
characters (via utf8::decode($html)) or tell STDOUT that it does not
expect wide strings, but raw octets (= remove binmode STDOUT, ":utf8";)
line.

Again... think about it, why both my proposed fixes are working.



Btw, perl does not use UTF-8 encoding, but perl's extended utf8. If you
want strict UTF-8, use ":encoding(UTF-8)" layer. Layers ":utf8" or
":encoding(utf8)" (without hyphen) are those non-strict perl's extended
utf8 encodings. Also utf8::encode/decode are non-stricts...

Gert Brinkmann

unread,
Nov 9, 2016, 2:00:02 PM11/9/16
to perl-u...@perl.org

Pali, thank you very much for your answer. I am using the
Encode::decode('UTF-8', ...) function now instead of touching the flag.
Though I am not sure if a routine becomes better (more robust) if it
accepts utf8 instead the stricter utf-8. Or if it is better if it only
accepts strict utf-8?

On 09.11.2016 16:20, pa...@cpan.org wrote:

> Fix is really simple. Either decode utf8 octets in $html back to wide
> characters (via utf8::decode($html)) or tell STDOUT that it does not
> expect wide strings, but raw octets (= remove binmode STDOUT, ":utf8";)
> line.
>
> Again... think about it, why both my proposed fixes are working.

I am near to understand it. But I wonder why I have to think about utf-8
in this case? I expected that perl is doing it right automagically:

I open the filehandle to write into the variable using :encoding(UTF-8).
So perl should know what it is storing inside the variable. If I print
this to STDOUT (binmoded to utf-8) it should automatically print the
content of the variable the right way.

So why does it not know about the content being utf-8? If I am using
"use utf8" and define an utf-8 data containing variable in the source
code, perl knows to handle this the correct way, too, without the need
to decode anything manually.

Probably perl does not know about the content of the variable. Only the
filehandle is set to write utf-8 data. The content of the variable is
only bytes, similar to a file that I am writing bytes into. If I read
the file again, I have to open it as utf-8. Alternatively I guess I can
open it as raw bytes and decode the data afterwards to utf-8? The latter
way would be the same as the decoding of the variable content?

Ciao
Gert



pa...@cpan.org

unread,
Nov 9, 2016, 3:45:02 PM11/9/16
to perl-u...@perl.org, Gert Brinkmann
On Wednesday 09 November 2016 19:46:46 Gert Brinkmann wrote:
> Pali, thank you very much for your answer. I am using the
> Encode::decode('UTF-8', ...) function now instead of touching the
> flag. Though I am not sure if a routine becomes better (more robust)
> if it accepts utf8 instead the stricter utf-8. Or if it is better if
> it only accepts strict utf-8?

'UTF-8' (with hyphen) is strict UTF-8. UTF8, utf8 (without hyphen) is
non-strict perl's extended utf8.

What to use, depends on your needs... I would really suggest to use
strict UTF-8 when doing data exchange and sending or receiving data
to/from world.

> On 09.11.2016 16:20, pa...@cpan.org wrote:
> > Fix is really simple. Either decode utf8 octets in $html back to
> > wide characters (via utf8::decode($html)) or tell STDOUT that it
> > does not expect wide strings, but raw octets (= remove binmode
> > STDOUT, ":utf8";) line.
> >
> > Again... think about it, why both my proposed fixes are working.
>
> I am near to understand it. But I wonder why I have to think about
> utf-8 in this case? I expected that perl is doing it right
> automagically:
>
> I open the filehandle to write into the variable using
> :encoding(UTF-8). So perl should know what it is storing inside the
> variable. If I print this to STDOUT (binmoded to utf-8) it should
> automatically print the content of the variable the right way.

String is just sequence of characters. And character in just number. In
C language (char), on disk, or in other storage is character 8bit. In
perl it can be up-to 64bit (if you have 64bit perl). And in perl that
number represent Unicode code point. So 0x100 is LATIN CAPITAL LETTER A
WITH MACRON, 0xFE is LATIN SMALL LETTER THORN, ...

UTF-8 is transformation which convert between sequence of Unicode code
points and sequence of 8bit numbers. And Encode::encode('UTF-8', $str)
just take sequence of (wide-unicode) numbers from $str, convert them to
UTF-8 sequence and returns sequence of 8bit numbers. String is just
sequence of numbers, so perl thinks about that returned scalar as string
(which now has different meaning).

:encoding(UTF-8) or :utf8 layers just do automatic encoding/decoding of
written/read data. Same as if you call encode/decode manually
before/after print/read.

If you look at your code again it can be rewritten as:

use strict;
use utf8;
use Encode;
use FileHandle;
my $html = '';
open(my $fh, '>', \$html);
my $orig_stdout = select( $fh );
print Encode::encode('UTF-8', "Ümläut Test ßaß; 使用下列语言\n");
select( $orig_stdout );
$fh->close();
print Encode::encode('UTF-8', $html);

I just used explicit encode calls, instead implicit (which are hidden in
:utf8 resp. :encoding(UTF-8) layers).

Look at it again, you encoded string two times! Encoding is done when
you write to FH and decoding when you read from FH.

UTF-8 encoder takes sequence of numbers (range 0x00..0x10FFFF minus some
disallowed) and returns another sequence of numbers (range 0x00..0xFF).
And if you call it two times, then that you got something which is two
times encoded = garbage.

> So why does it not know about the content being utf-8?

Because perl strings scalars are always treated as sequence of numbers
and each number represent one (unicode) character. Perl scalar does not
anything that it is "raw" (e.g. it is sequence of UTF-8) or normal.

> If I am using
> "use utf8" and define an utf-8 data containing variable in the source
> code, perl knows to handle this the correct way, too, without the
> need to decode anything manually.

use utf8 tells perl that string constants are wide unicode strings.

Take an example:

use utf8;
my $str = "使用下列语";

is equivalent to:

my $str = "\x{0x4F7F}\x{0x7528}\x{0x4E0B}\x{0x5217}\x{0x8BED}";


Example without utf8:

my $str = "使用";

is equivalent to:

my $str = "\x{0xE4}\x{0xBD}\x{0xBF}\x{0xE7}\x{0x94}\x{0xA8}";

In this case input source file was parsed as 8bit file and string
contains different characters.

> Probably perl does not know about the content of the variable.

You can say that. It really does not know if variable contains sequence
of ISO-8859-1 numbers, or sequence of UTF-8 numbers or Unicode code
points... It always think and treat variable as sequence of Unicode code
points. And if you store something else into it, that is your
responsibility.

> Only
> the filehandle is set to write utf-8 data. The content of the
> variable is only bytes, similar to a file that I am writing bytes
> into.

But with binmode STDOUT, ":utf8"; you said that data which are you going
to write are *not* raw and perl must first encode them to UTF-8. So it
is expected that $html is not raw (as you already did).

> If I read the file again, I have to open it as utf-8.
> Alternatively I guess I can open it as raw bytes and decode the data
> afterwards to utf-8? The latter way would be the same as the
> decoding of the variable content?

If you still do not see how it works, then forgot about existence of
perlio layers and write explicit encode/decode calls. After that if you
fully understand how and where to call encode/decode, you can replace
those explicit encode/decode calls by implicit via perlio layers.


... I hope this helps you ...

Aristotle Pagaltzis

unread,
Nov 10, 2016, 4:45:02 AM11/10/16
to perl-u...@perl.org
* Gert Brinkmann <g1...@netcologne.de> [2016-11-09 16:00]:
> open(my $fh, '>:encoding(UTF-8)', \$html);
> my $orig_stdout = select( $fh );
> print "Ümläut Test ßaß; 使用下列语言\n";

Think of it this way:

Those three lines of code are an elaborate way of doing this:

$html = Encode::encode('UTF-8', "Ümläut Test ßaß; 使用下列语言\n");

If you wrote that code, would you be surprised that $html does not
have the UTF8 flag set afterwards?

Bonus question if you are not surprised then: what is the difference
between these two cases that makes your argument that “perl knows what
I put in there so it should know to set the UTF8 flag on it” not apply
to this?

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
0 new messages