Utf-8

29 views
Skip to first unread message

NeoBunch

unread,
Feb 24, 2005, 10:41:24 AM2/24/05
to spreadsheet...@googlegroups.com
In the modules' documentation it is stated that, from Perl 5.8 onward,
SS::WE will handle the utf-8 strings natively. Does this mean only for
processes internal to the module? Meaning if you want to write unicode
chars to an excel workbook, you still have to encode them somehow
beforehand? Cause the module works great and all the text in the
spreadsheets is readable except for the unicode chars. I have perl
5.8.4 and my source data is definitely in utf-8.
Thx in advance. Hope this group catches on.

Rob Kinyon

unread,
Feb 24, 2005, 10:44:56 AM2/24/05
to spreadsheet...@googlegroups.com
Accoding to http://www.ahinea.com/en/tech/perl-unicode-struggle.html,
you can do the following:

binmode $fh, ':utf8';

And that will mark the filehandle as UTF-8 for the filesystem. That may help.

jmcnamara

unread,
Feb 24, 2005, 11:31:31 AM2/24/05
to spreadsheet...@googlegroups.com
Utf-8 handling should be transparent when using perl 5.8 and
Spreadsheet::WriteExcel version 2.10 onwards. You shouldn't have to
encode them in any way.

You can doublecheck what version you have installed as follows:

(Unix)
perl -le 'eval "require $ARGV[0]" and print $ARGV[0]->VERSION'
Spreadsheet::WriteExcel

(Windows)
perl -le "eval qq(require $ARGV[0]) and print $ARGV[0]->VERSION"
Spreadsheet::WriteExcel

After that try some of the unicode_*.pl examples that come with the
distro to see if they work for you. For example:

http://search.cpan.org/src/JMCNAMARA/Spreadsheet-WriteExcel-2.11/examples/unicode_polish_utf8.pl

John.
--

NeoBunch

unread,
Feb 24, 2005, 1:01:49 PM2/24/05
to spreadsheet...@googlegroups.com
Thanks, John, nice to see you care about this module of yours.

Versions are fine, I tried the examples included in the distro and my
data then appears correctly in the workbook. But, in your examples, you
open the data files with an additional parameter: '<:encoding(utf8)',
if I remove that I get the same behaviour I'm getting from my other
data (so the data still needs to be 'cast', in a sense). Problem is,
I'm not getting my data from a file like in the examples (otherwise I
would just add the encoding parameter and voila!). So now I know my
problem lies somewhere else within perl. Adding 'use utf8' solved half
of it. Is there any simple way of doing the same thing
'<:encoding(utf8)' does for a file but for a string already in memory?

thx again :)

Rob Kinyon

unread,
Feb 24, 2005, 2:45:38 PM2/24/05
to spreadsheet...@googlegroups.com
IO::Scalar can solve this problem. Also, you can use in-memory
filehandles, which are also new to 5.8.

Rob

jmcnamara

unread,
Feb 24, 2005, 6:57:50 PM2/24/05
to spreadsheet...@googlegroups.com
> if I remove that I get the same behaviour I'm getting from my other
> data (so the data still needs to be 'cast', in a sense).

This is the heart of the problem. Perl doesn't know that a string of
bytes is UTF8, it needs to be told in some way. Either via the encoding
that you read the data with or via the Encode module or some other
module.

If you have a string of bytes that you know is UTF8 then you can get
perl to treat it as such with Encode::decode_utf8():

#!/usr/bin/perl -w

use strict;
use Spreadsheet::WriteExcel;
use Encode 'decode_utf8';

my $workbook = Spreadsheet::WriteExcel->new('reload.xls');
my $worksheet = $workbook->add_worksheet();

my $str1 = pack "H*", "e298ba"; # Bytes that look like utf8.
my $str2 = decode_utf8 $str1; # Now they are utf8.
my $str3 = "\x{263a}"; # Directly as utf8.

$worksheet->write('A1', $str1);
$worksheet->write('A2', $str2);
$worksheet->write('A3', $str3);


__END__


Refer to the perluniintro and Encode man pages for more information.

See if this gets you a little further along. If not see if you can
create a sample program that demonstrates your situation.

John.
--

NeoBunch

unread,
Feb 25, 2005, 7:48:54 PM2/25/05
to spreadsheet...@googlegroups.com
Hey, John, thx again.

I suspected my problem was something along those lines, so I started
going through the utf-8 and unicode perl docs. And hilarity ensued :).
I finally managed to solve my problem. For your module to recognize the
utf-8 strings, they have to be marked as such, with something called...
the utf-8 flag! I knew my strings where in utf-8, so a simple:

Encode::_utf8_on($_);

solves the dilemma (turns the flag on). Now, so I contribute something
back, let me share what I learned in the way:

Another way to solve the problem would've been what you proposed; and,
in fact, is more elegant and robust. But there are also other ways and
it can get confusing, especially since my data lives in what is known
as 'latin1' charset. The problem with this charset is that it has 2
diferent byte encodings: latin1 which makes each char fit in a single
byte (like ASCII) and utf-8, where they fit in 2 bytes. Now, with ASCII
only, you never have to worry since it's always 1 byte, and with higher
code-point chars (above U+00FF) you always have to have utf-8. Had my
data been in, say, greek or ciryllic, I would've been forced to use
utf-8 since the very beginning.

But let me be more clear, since I think I'm not making much sense; my
data in the database is encoded in utf-8, and my web server and
aplication churn out pages in utf-8 (and in the browser you can see
that they are, indeed, utf-8 encoded) and everything worked fine. But
now I've just realized that my Perl application (the glue between the
DB and the Web server) was handling all of the strings as plain single
byte chars, and not in utf-8. So if I got 'árbol' from the DB, Perl
thought those were 6 chars, when they're 5, (2 from the 'á' [hope you
can see the char, it's an 'a' with an acute accent]), and on
outputting them to the Web Server, it again outputted 6 chars, which,
when looked at from a browser with the correct encoding, look fine. So
I never knew I had a problem until now when I tried to use your module
which requires that the utf-8 strings are marked as such to work
correctly.

I also lost lots of time because the test case I was using to try out
had a character that MySQL for some strange reason doesn't encode
properly in utf-8 (it's the 'Á' char, which is part of latin1, so it
must be a bug in MySQL since the chars preceding and succeding it
encode just fine, and they're all below U+00FF) and I was getting back
malformed utf-8. Had I picked another DB query for my test I would've
saved half a day of head-scratching.

I think I've gone for long enough, so I'll just sum it up:

use utf8; #only good for interpreting string literals and regexps in
the source

use encoding 'utf8'; #this one ensures all of your strings are marked
and encoded as utf-8, this would've solved the problem but this far
into the app development it breaks more things than it fixes for me. If
you're just starting your app, use this pragma and forget about the
problem.

Encode::_utf8_on($_); #ONLY sets the utf8 flag on, doesn't check if
the utf-8 encoding in the string is valid. Use only if you're positive
that the string contained in there is correct utf-8.

$_ = decode_utf8($_); #Better way to set the flag on, since it checks
the validity of the utf-8 encoding. However, it will break if the
string is not in utf-8, so a better all around way to make absolutely
sure your string is encoded and recognized by perl as utf-8 is:


($utf8 = decode_utf8($_)) or ($utf8 = decode_utf8(encode_utf8($_)));

which will flag your utf8 string as such or encode a native-perl string
into utf8.

And

utf8::upgrade($_) which will look for the utf8 flag, if it's on, does
nothing, if it's not, it encodes each byte on the string into utf-8 and
then sets the flag. But this one's very destructive and I can't use it
in my case, since my strings ARE utf8 encoded but they don't have the
flag ON, so this function reencodes my utf8 into utf8 thinking it's
plain single-byte chars, destroying the data.

If you made it this far, you're probably more confused now. Follow
John's advice and read up on the unicode docs, but be warned that there
are a few inconsistencies here and there, which is why it took me so
long to get it down. Good luck.

Thx again
I promise never to do that again :)

Reply all
Reply to author
Forward
0 new messages