Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Error in Handling Unicode(UTF16-LE) File & String

3 views
Skip to first unread message

iaminsik

unread,
May 6, 2008, 4:00:50 AM5/6/08
to
In most cases, I converted utf-16le files into utf-8 encoding.
But, I want to handle utf-16le files directly.

My first source is "read a line from utf-16le file and write it in
utf-16le encoding".
It works well.

==========================================================
use utf8;
use Encode;

open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
binmode $infile;
open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
binmode $outfile;

while ($line = <$infile>)
{
print $outfile $line;
}

close($infile);
close($outfile);
==========================================================

Second source is "read one line, split it into array, and print array
by line in utf-16le encoding".
It seemed to work well, but some characters were broken. It didn't
work well.
After a long web searching, I recognized Unicode::String could solve
this problem.

==========================================================
use utf8;
use Encode;

$\ = "\n";

open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
binmode $infile;
open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
binmode $outfile;

while ($line = <$infile>)
{
chomp($line);
@words = split(/[ ]+/, $line);
foreach $word (@words)
{
print $outfile $word;
}
}

close($infile);
close($outfile);
==========================================================

Using Unicode::String, I made the third source, but still it doesn't
work.
It means "reading" is OK, but split function isn't.
Is there any solution?
==========================================================
use utf8;
use Encode;
use Unicode::String;
Unicode::String->stringify_as('utf16');

$\ = "\n";

open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
binmode $infile;
open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
binmode $outfile;

while ($line = <$infile>)
{
chomp($line);
$sep = new Unicode::String ("[ ]+");
@words = split($sep, $line);
foreach $word (@words)
{
print $outfile $word;
}
}

close($infile);
close($outfile);
==========================================================

Best Regards.
Remi

Ben Bullock

unread,
May 6, 2008, 6:44:09 AM5/6/08
to
On Tue, 06 May 2008 01:00:50 -0700, iaminsik wrote:

> In most cases, I converted utf-16le files into utf-8 encoding. But, I
> want to handle utf-16le files directly.
>
> My first source is "read a line from utf-16le file and write it in
> utf-16le encoding".
> It works well.

No it doesn't. Your problems are all in the first file.

> open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
> binmode $infile;
> open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
> binmode $outfile;

Do you know what binmode does? You'd better have another look at the
manual (perldoc -f binmode). The binmode statements here switch OFF all
the :raw:encoding(UTF... stuff you'd put in the previous lines, which
explains all the other problems you had.

To demonstrate, try this:

#!/usr/local/bin/perl
use warnings;
use strict;
use utf8;
use Encode;
binmode STDOUT, "utf8";
my $utf8 = "モンスター 自惚れ";
for (qw/file1 file2/) {
open (my $outfile, ">:raw:encoding(UTF-16LE):crlf", "$_.dat") or die
$!;
binmode $outfile if /1/; # do what you did for file1 only
print $outfile $utf8;
close $outfile or die $!;
open (my $infile, "<:encoding(UTF-16LE):crlf", "$_.dat");
while (my $line = <$infile>)
{
print "$_: $line\n";
}
close($infile) or die $!;
}


The reason your code appeared to work is because you never did anything
with the data. It was actually just reading and writing it as bytes
without any knowledge of the encoding. As soon as you tried to manipulate
the data, the problem which had been there all along became visible.

P.S. use warnings; use strict; & check the values of open and close as
above.

Ben Bullock

unread,
May 6, 2008, 6:57:39 AM5/6/08
to
On Tue, 06 May 2008 10:44:09 +0000, Ben Bullock wrote:

> open (my $infile, "<:encoding(UTF-16LE):crlf", "$_.dat");

> P.S. use warnings; use strict; & check the values of open and close as
> above.

Oops!

iaminsik

unread,
May 7, 2008, 1:11:20 AM5/7/08
to
Thanks a lot, Ben!
Reading your reply, I could afford to understand 'binmode' slightly.

I modified my source in two ways.

1. First Source
========================================================================
no warnings;
use utf8;

$\ = "\n";

open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
binmode $outfile;

while ($line = <$infile>)
{
chomp($line);
@words = split(/[ ]+/, $line);
foreach $word (@words)
{
print $outfile $word;
}
}

close($infile);
close($outfile);
========================================================================
The first source generates 'wide character warnings',
and saves outfile in utf8 format, weirdly.

2. Second Source
========================================================================
no warnings;
use utf8;

$\ = "\n";

open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
# binmode $outfile;

while ($line = <$infile>)
{
chomp($line);
@words = split(/[ ]+/, $line);
foreach $word (@words)
{
print $outfile $word;
}
}

close($infile);
close($outfile);
========================================================================
I made 'binmode $outfile;' as a comment line,
and it saves outfile in UTF-16LE format I wanted.

3. Several Questions
I, a newbie in Perl programming language, couldn't understand two
parts in your codes.
========================================================================
for (qw/file1 file2/) { <===== what it means? it's a short expression
for loop?
binmode $outfile if /1/; <===== what /l/ means?
========================================================================

Best Regards for your help.
Remi.

Ben Bullock

unread,
May 7, 2008, 1:46:46 AM5/7/08
to
iaminsik <iami...@gmail.com> wrote:

> The first source generates 'wide character warnings',
> and saves outfile in utf8 format, weirdly.

It's not weird; you have "use utf8;" there, so it reads in using the
encoding you specified, then the binmode switches off the output
formatting, then it prints it out in the default format, which
generates wide character warnings because you haven't explicitly set
the mode of the output to anything. Use

binmode $outfile,"utf8";

to switch those wide character warnings off.

> I made 'binmode $outfile;' as a comment line,
> and it saves outfile in UTF-16LE format I wanted.

Good news.

> 3. Several Questions
> I, a newbie in Perl programming language, couldn't understand two
> parts in your codes.
> ========================================================================
> for (qw/file1 file2/) { <===== what it means? it's a short expression
> for loop?

This sets $_ to "file1" then "file2". qw/a b/ equals ('a', 'b').

> binmode $outfile if /1/; <===== what /l/ means?

It's not an l it's a 1. "if /1/" has the effect of saying 'if $_ is
"file1"'. The /1/ detects the character '1' in the name. Try
experimenting with the code to understand what it does.

iaminsik

unread,
May 7, 2008, 11:38:05 PM5/7/08
to
On 5월7일, 오후2시46분, benkasminbull...@gmail.com (Ben Bullock) wrote:

Your comment helped me a lot.
Thanks, Ben!

Best Regards,
Remi.

0 new messages