Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Why "Wide character in print"?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Ben Morrow  
View profile  
 More options Oct 24 2012, 10:18 pm
Newsgroups: comp.lang.perl.misc
From: Ben Morrow <b...@morrow.me.uk>
Date: Thu, 25 Oct 2012 03:15:12 +0100
Local: Wed, Oct 24 2012 10:15 pm
Subject: Re: Why "Wide character in print"?

Quoth Eli the Bearded <*...@eli.users.panix.com>:

> In comp.lang.perl.misc, tcgo  <tomeu...@gmail.com> wrote:
> > And it gives me a "warning" message: "Wide character in print at
> ./unicode line 4". After
> > adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was
> it showing before of
> > adding the binmode?
<snip>

> The explanation in perldiag is a good start:

>   =item Wide character in %s

>   (S utf8) Perl met a wide character (>255) when it wasn't expecting
>   one.  This warning is by default on for I/O (like print).  The easiest
>   way to quiet this warning is simply to add the C<:utf8> layer to the
>   output, e.g. C<binmode STDOUT, ':utf8'>.  Another way to turn off the
>   warning is to add C<no warnings 'utf8';> but that is often closer to
>   cheating.  In general, you are supposed to explicitly mark the
>   filehandle with an encoding, see L<open> and L<perlfunc/binmode>.
<snip>

> But that (and the docs for binmode()) doesn't address why the warning
> will still
> happen for ":raw" streams:

> echo "some binary stream with U+2639 in it" | \
>    perl -we 'binmode(STDOUT, ":raw");
>              binmode(STDIN,  ":raw");
>                      while(<>) { s/U\+2639/\x{2639}/g; print } '

[You should set $/ = \1024 or something else appropriate before using <>
on a binary file. By default <> reads newline-delimited lines, and there
is no particular reason for newlines to occur in sensible places in a
binary file. Of course, if the file is small enough it may be better to
read the whole thing and skip the while loop altogether.]

If you are dealing with :raw streams then your data needs to be in
bytes. That is, you should be using

    use Encode "encode";

    my $u2639 = encode "UTF-8", "\x{2639}";

    s/U\+2639/$u2639/g;

Imagine you were trying to perform this replacement the other way
around; a substitution like

    s/\x{2639}/U+2639/;

would never match, since the :raw layer would return a UTF8-encoded
U+2639 as three bytes. (It would also return a UTF-16 U+2639 as two
bytes, and a UCS-4 U+2639 as four bytes.) If you wanted it to match you
would need to use $u2639 defined as above, and deal with the possibility
of the character being split between chunks.

> I've used the "while(<>) { s///g; print; }" construct to patch binary
> files in the past (rename functions in compiled programs, etc). I haven't
> yet needed to sub-in wide characters, but it doesn't seem unreasonable.

A binary file cannot contain 'wide characters' as such, instead it
contains some *encoding* of wide characters. Since Perl has no way to
guess which encoding you want you need to be explicit, either by using
Encode directly or by calling it indirectly using PerlIO::encoding.

> I'm guessing that my binary stream situation is what "no warnings
> 'utf8';" is intended to fix.

No, not at all. If you review the (W utf8) warnings in perldiag, you
will see they all to do with performing character operations on Unicode
codepoints which are not valid characters (UTF-16 surrogates, codepoints
which haven't been allocated yet, explicit non-characters like U+FFFF).
They have nothing to do with ordinary Unicode IO.

Ben


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.