Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Clear the "Wide character in print" warning and leave the output unmangled

36 views
Skip to first unread message

jid...@jidanni.org

unread,
Nov 2, 2012, 1:23:24 AM11/2/12
to
None of the advice on perlunifaq or elsewhere can both
* Clear the "Wide character in print" warning, and
* Leave the output non doubly encoded.

#!/usr/bin/perl

# How to test this program:
# $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
# $ cat /tmp/o
# That will show you any problems it has.

# Print out YouTube playlists. Usage:
# Example: $0 YouTubeUserID
# Example: restriction=TW $0 jidanni2
# Copyright : http://www.fsf.org/copyleft/gpl.html
# Author : Dan Jacobson -- http://jidanni.org/
# Created On : Wed Mar 2 08:35:33 2011
# Last Modified On: Fri Nov 2 13:23:16 2012
# Update Count : 830
use strict;

#use Encode;
#use warnings FATAL => 'all';
#binmode STDIN, ":utf8";
#binmode STDOUT, ':encoding(UTF-8)';
#binmode STDIN, ':encoding(UTF-8)';
#binmode(STDOUT);
#binmode STDERR, ":utf8";binmode STDOUT, ":utf8";binmode STDIN, ":utf8";
#use utf8;
#use open qw/:std :encoding(utf8)/;
##use diagnostics;
#use Data::Dumper;
use WebService::GData::Constants qw(:all);
use WebService::GData::YouTube;
die 'Specify a user please.' unless my $user = shift;
my ( %checklist, %vids, $playlists, );
my $yt = new WebService::GData::YouTube();
$yt->connection->env_proxy;
##$yt->connection->enable_compression(TRUE); #disaster
$yt->query->max_results(50);

#if the number of your playlist is superior to 50, you will need to
#loop via the result like you used to do before with the video results
#(start_index+items_per_page). there is no other easy way to do this
#yet.

eval { $playlists = $yt->get_user_playlists($user) } or die $@->content;
@$playlists = sort { $a->title cmp $b->title } @$playlists;
for ( $ENV{restriction} ) { $yt->query()->restriction($_) if $_ }

for my $playlist (@$playlists) {
my @missing = (undef) x $playlist->count_hint;
my $entries;
while (
eval {
## can't use compression starting here:
$entries = $yt->get_user_playlist_by_id( $playlist->playlist_id );
}
)
{
die $@->content if $@;
for my $entry (@$entries) {
my $IDP = ( split( /:/, $entry->id ) )[-1] or die;
if ( $entry->appcontrol_state
&& $entry->appcontrol_state eq "requesterRegion" )
{
# print "yy$IDP ", $entry->id, "\n";
next;
}

## http://code.google.com/intl/en/apis/youtube/2.0/reference.html#youtube_data_api_tag_yt:state
## Also one day could use
## my $string = $entry->denied_countries;
## my @matches = $string=~m/(TW|US)/g;

delete $missing[ $entry->position - 1 ];
my $v = sprintf "%03d|%s|%s|%s", $entry->position, $entry->video_id,
$IDP,
$entry->title;

if ( $entry->media_player ) {

# use Data::Dumper;
push @{ $vids{1}{ $playlist->playlist_id } }, $v;

# print STDERR Dumper("紅",$playlist->title), "紅", $playlist->title;
# die;
unless ( $playlist->title eq '英文歌詞 English lyrics' ) {
push @{ $checklist{ $entry->video_id } }, join "|",
$playlist->title,
$v;
}
}
else {
push @{ $vids{0}{ $playlist->playlist_id } },
"# $v|" . $entry->appcontrol_state;

# print STDERR "xx$IDP\n";

}
}
}
for ( 0 .. $playlist->count_hint - 1 ) {
if ( exists $missing[$_] ) {
push @{ $vids{0}{ $playlist->playlist_id } },
sprintf "# %03d|Problem!", $_ + 1;
## try watching it in a browser when logged out to find out what was wrong
}
}
}
{
my ( $total, $list ) = ( 0, 'Duplicates' );
printf "\n%d playlists, %d videos.\n:::: $list:\n", scalar @$playlists,
scalar keys %checklist;
for ( keys %checklist ) {
if ( $#{ $checklist{$_} } ) {
for ( @{ $checklist{$_} } ) { print "$_\n"; $total++ }
}
}
print "Total $list: $total\n";
}

for my $playlist (@$playlists) {
push @{ $vids{0}{ $playlist->playlist_id } }, "Empty playlist!"
unless $vids{1}{ $playlist->playlist_id };
}

{
my @list = qw/Unavailable Available/;
for ( 0, 1 ) {
print "\n:::: $list[$_]:\n";
my $total = 0;
for my $playlist (@$playlists) {
next unless $vids{$_}{ $playlist->playlist_id };
## print '==== http://www.youtube.com/my_playlists?p=',
## print '==== http://www.youtube.com/playlist?action_edit=1&list=PL',
## print '==== http://www.youtube.com/playlist?list=PL',
print '==== http://www.youtube.com/playlist?list=',
$playlist->playlist_id, ' |', $playlist->title, "\n";
for ( sort @{ $vids{$_}{ $playlist->playlist_id } } ) {

# print decode_utf8( $_ ), "\n";
print $_, "\n";
$total++;
}
}
print "Total $list[$_]: $total\n";
}
}

Peter J. Holzer

unread,
Nov 3, 2012, 8:08:22 AM11/3/12
to
On 2012-11-02 05:23, jid...@jidanni.org <jid...@jidanni.org> wrote:
> None of the advice on perlunifaq or elsewhere can both
> * Clear the "Wide character in print" warning, and
> * Leave the output non doubly encoded.
>
> #!/usr/bin/perl
>
> # How to test this program:
> # $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
> # $ cat /tmp/o
> # That will show you any problems it has.

Thanks for providing a complete script which demonstrates the problem.
This makes finding the problem simpler. However:

[...]
> use WebService::GData::Constants qw(:all);
> use WebService::GData::YouTube;
> die 'Specify a user please.' unless my $user = shift;

I'm not going to create a youtube account just to test this script.
So I cannot test it.

Unfortunately, you didn't report where the "Wide character in print"
warning occurs, either, and it is not obvious to me from the source
code. I am guessing that it happens in the last loop, because you tried
to use decode_utf8 there.

So I'm just giving generic advice here:

1) Always use “binmode(..., ":encoding(...)");” explicitely on STDIN,
STDOUT and STDERR. The encoding must be the one your terminal uses,
so if your terminal supports UTF-8, use that. (for production, you
might want to use “use open ":locale"”, but for debugging it's best
to eliminate any source of variable behaviour and hardcode the
encoding).

2) Try to shorten your program further, to make it easier to see where
the problem is without actually running the program.

3) When processing character data, convert from (external) byte
encodings to (internal) character strings as early as possible.

My guess is that you get some byte encoded data from the
WebService::GData module. You should decode() this, and you should do
this as early as possible so that the rest of your code doesn't have
to care about the encoding. This is especially necessary if you
combine strings from several sources which might use different
encodings.

4) When searching for encoding problems, I like to use this simple
function to dump strings to stdout:

sub dumpstr {
my ($s) = @_;

print utf8::is_utf8($s) ? "char" : "byte";
print ":";
for (split //, $s) {
printf " %#02x", ord($_);
}
print "\n";
}

use it to dump the string that is giving you the warning or that is
double-encoded. That will usually tell you *what* is wrong with the
string, but not *why* it is wrong. Then go backwards through the code
to see where you get the string from. If the string is computed from
some other string(s) (e.g. concatenation, substring, etc), dump the
inputs in the same way. Eventually you will have identified the
source of the "wrong" string, and then you can probably fix it with
a simple call to decode() right at the source. (If you get the string
from a module, you might also want to file a bug report).

hp

--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

Peter J. Holzer

unread,
Nov 3, 2012, 2:33:03 PM11/3/12
to
On 2012-11-03 12:08, Peter J. Holzer <hjp-u...@hjp.at> wrote:
> On 2012-11-02 05:23, jid...@jidanni.org <jid...@jidanni.org> wrote:
>> None of the advice on perlunifaq or elsewhere can both
>> * Clear the "Wide character in print" warning, and
>> * Leave the output non doubly encoded.
>>
>> #!/usr/bin/perl
>>
>> # How to test this program:
>> # $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
>> # $ cat /tmp/o
>> # That will show you any problems it has.
>
> Thanks for providing a complete script which demonstrates the problem.
> This makes finding the problem simpler. However:
>
> [...]
>> use WebService::GData::Constants qw(:all);
>> use WebService::GData::YouTube;
>> die 'Specify a user please.' unless my $user = shift;
>
> I'm not going to create a youtube account just to test this script.
> So I cannot test it.

I spoke too soon. It turns out that the script can retrieve other
people's playlists, so I can run it with "jidanni2" as a parameter and
don't need an account and playlist of my own.


> Unfortunately, you didn't report where the "Wide character in print"
> warning occurs, either, and it is not obvious to me from the source
> code. I am guessing that it happens in the last loop, because you tried
> to use decode_utf8 there.

Yes, the guess was correct.


> My guess is that you get some byte encoded data from the
> WebService::GData module. You should decode() this, and you should do
> this as early as possible so that the rest of your code doesn't have
> to care about the encoding. This is especially necessary if you
> combine strings from several sources which might use different
> encodings.

This guess was also correct. $entry->title returns what looks like an
UTF-8 encoded string. So what I guess should be "台灣軍機 Taiwan
Aircrafts found in Google Earth" (char: 0x53f0 0x7063 0x8ecd 0x6a5f 0x20
0x20 0x20 0x54 0x61 ...) is returned as (char: 0xe5 0x8f 0xb0 0xe7 0x81
0xa3 0xe8 0xbb 0x8d 0xe6 0xa9 0x9f 0x20 0x20 0x20 0x54 0x61 ...).
To make it even more confusing, the string is marked as a character
string (the UTF8 bit is on) instead of a byte string. This is definitely
a bug in WebService::GData.

And even worse, it isn't even reliably wrong: "Rob 'N' Raz Featuring
Leila K - Got To Get" has an U+200E character just before the -. This is
probably where the "wide character" warning came from. After I put in an
appropriate “decode("UTF-8" $entry->title)”, it dies now. So I would
have to wrap that in an eval {} block or possibly use some heuristics to
check whether decoding is necessary or not. This is where I stop
and let you take over.

So, to summarize:

1) Put in “binmode STDOUT, ":encoding(UTF8)";”
2) Put in “use utf8;”
3) decode() the return value of $entry->title (and possibly some other
calls) but be aware that this doesn't always work, so you need a
fall-back strategy.
4) Report the bug.

jid...@jidanni.org

unread,
Nov 4, 2012, 12:28:17 AM11/4/12
to
No you don't need a Youtube account.
Yes I tried all you said but it doesn't work.
Even on http://ahinea.com/en/tech/perl-unicode-struggle.html

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";
binmode DATA, ":utf8";
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

This should print two equal lines and make no annoying warning.

But nowadays it DOES make the annoying warning.

In my program no matter if I do

# $entry->title;
utf8::is_utf8( $entry->title )
? $entry->title
: encode( "utf8", $entry->title );

with encode or decode etc. etc. it doesn't work.
It all just doesn't work.

jid...@jidanni.org

unread,
Nov 4, 2012, 12:29:17 AM11/4/12
to
Here's my stupid program again.

#!/usr/bin/perl

# How to test this program:
# $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
# $ cat /tmp/o
# That will show you any problems it has.
# No you don't need a youtube account.

# Print out YouTube playlists. Usage:
# Example: $0 YouTubeUserID
# Example: restriction=TW $0 jidanni2
# Copyright : http://www.fsf.org/copyleft/gpl.html
# Author : Dan Jacobson -- http://jidanni.org/
# Created On : Wed Mar 2 08:35:33 2011
# Last Modified On: Sun Nov 4 12:38:33 2012
# Update Count : 851
use strict;
use Encode;

#use warnings FATAL => 'all';
binmode STDERR, ":utf8";
#binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";

#use utf8;
#use open qw/:std :encoding(utf8)/;
##use diagnostics;
#use Data::Dumper;
use WebService::GData::Constants qw(:all);
use WebService::GData::YouTube;
die 'Specify a user please.' unless my $user = shift;
# $entry->title;
utf8::is_utf8( $entry->title )
? $entry->title
: encode( "utf8", $entry->title );

# dumpstr($entry->title);

if ( $entry->media_player ) {

# use Data::Dumper;
push @{ $vids{1}{ $playlist->playlist_id } }, $v;

# print STDERR Dumper("紅",$playlist->title), "紅", $playlist->title;
# die;
unless ( $playlist->title eq '英文歌詞 English lyrics' ) {
push @{ $checklist{ $entry->video_id } }, join "|",
utf8::is_utf8( $playlist->title )
? $playlist->title
: encode( "utf8", $playlist->title ),

# $playlist->title,
# Encode::_utf8_on($_);
# Encode::from_to($_, "utf8", "utf8");

# dumpstr($_);

print $_, "\n";
$total++;
}
}
print "Total $list[$_]: $total\n";
}
}

sub dumpstr {
my ($s) = @_;

# print utf8::is_utf8($s) ? "char" : "byte";

Ben Morrow

unread,
Nov 5, 2012, 4:04:00 PM11/5/12
to

Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>
> To make it even more confusing, the string is marked as a character
> string (the UTF8 bit is on) instead of a byte string. This is definitely
> a bug in WebService::GData.

No, it isn't. It's the programmer's resposnsibility to track whether a
string represents bytes or characters; the SvUTF8 flag is none of your
business.

Ben

jid...@jidanni.org

unread,
Nov 5, 2012, 10:50:33 PM11/5/12
to
Thanks everybody. I have elected the simple solution, s/\N{U+200E}//g;
And I now removed any binmode, etc. so as to just deal with good old
fashioned bytes. Much simpler than attempting any 'correct' solution.

I suppose I should have realized that since there was only one wide
character warning, just one input line was causing the warning after
all... Thanks to Peter J. Holzer for waking me up to the fact!
0 new messages