Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

perl unicode utf8 and JSON::XS

195 views
Skip to first unread message

Christophe Martin

unread,
Apr 7, 2013, 12:08:52 PM4/7/13
to begi...@perl.org
Hello ,

I'm trying to use JSON::XS. The trouble is that I have French strings
with accented chars. The solutions I tried don't work. Also, I'm
not sure i understand JSON::XS doc on unicode and utf.

I use linux. The shell, terminal, and files all use utf-8.
Locales are installed, and I have LANG=fr_FR.UTF-8 and a bunch of
LC_THING=fr_FR.UTF-8 env variables. Apart from my script, everything
works perfectly well.

As you can see below, with "use utf8;", json is ok and can be
decoded by other progs, but perl strings turn to latin1 encoding.
with "no utf8" or without "use utf8", json strings are
«utf8 encoded twice» but simple perl strings are ok.

Ubuntu linux 12.04 with perl 5.14.2, and JSON::XS 2.320-1build1
if that matters

I must have missed something, hopefully simple. Any Idea ?

#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use JSON::XS;

my %srchash = ( 'éléphant' => 'ça trompe' );
my $json = encode_json \%srchash;
my %dsthash = %{decode_json $json};
while ( my ($key, $value) = each %srchash ) {
print $key, ' => ', $value, "\n";
}
print $json, "\n";
while ( my ($key, $value) = each %dsthash ) {
print $key, ' => ', $value, "\n";
}

The results without "use utf8;" (or with "no utf8;") :
éléphant => ça trompe
{"éléphant":"ça trompe"}
éléphant => ça trompe

with "use utf8", here is what I get
�l�phant => �a trompe
{"éléphant":"ça trompe"}
�l�phant => �a trompe


Brandon McCaig

unread,
Apr 8, 2013, 1:07:13 PM4/8/13
to Christophe Martin, begi...@perl.org
On Sun, Apr 07, 2013 at 06:08:52PM +0200, Christophe Martin wrote:
> Hello ,

Hello,
The documentation says that the exported decode_json and
encode_json subs expect the input and output to be UTF-8 encoded.
That is, they expect the keys and values to be binary strings
encoded in UTF-8. Then internally JSON::XS will decode and encode
as necessary.

The utf8 pragma tells Perl that the source code is UTF-8 encoded,
and that Perl should automatically decode scalar strings
(including hash keys), etc., automatically.

So I think that the use cast for decode_json and encode_json is
different, though I honestly can't figure out what it would be...

This seems to work though. Note that I explicitly set the IO
encoding layer for the STDOUT handle and use the JSON::XS OO
interface with automatic UTF-8 decoding/encoding disabled (since
Perl has already handled that for us).

#! /usr/bin/perl

use utf8;
use strict;
use warnings;

use Data::Dumper;
use Encode;
use JSON::XS;

binmode \*STDOUT, ':encoding(UTF-8)';

my %src = ( 'éléphant' => 'ça trompe' );

my $jsonizer = JSON::XS->new();
my $json = $jsonizer->encode(\%src);
my %dst = %{$jsonizer->decode($json)};

print "$_\n" for keys %src, values %src;
print $json, "\n";
print "$_\n" for keys %src, values %dst;

__END__

Since we used the utf8 pragma Perl is already decoding the keys
and values as UTF-8 and storing them appropriately internally
(the details of which shouldn't matter to us). So JSON::XS
shouldn't need to do any character encoding/decoding with the
data. It is already internally stored properly and all operations
on those values in Perl should be character-wise automatically.

I'm not really sure then what purpose JSON::XS::utf8 is intended
to serve. It sounds to me like when enabled it expects the input
data structure to be binary (UTF-8 encoded data). I'm not sure
under what circumstances you'd want hash keys to be in binary
when you could instead just decode the data from where ever into
text strings and happily work with them from there...

I wonder if maybe it could be a bug in JSON::XS that it modifies
these already Unicode-aware scalars. I am not qualified to assert
that it is so. The documentation seems clear that the behavior is
by design and I have to assume that the module author(s) know
more about Unicode in Perl than I do. :)

Regards,


--
Brandon McCaig <bamc...@gmail.com> <bamc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

signature.asc

Brandon McCaig

unread,
Apr 8, 2013, 10:04:47 PM4/8/13
to Christophe Martin, begi...@perl.org
On Mon, Apr 08, 2013 at 01:07:13PM -0400, Brandon McCaig wrote:
> use Data::Dumper;
> use Encode;

I should note that neither of these packages were used in the
example program that I posted and aren't required for it to run.
:) They were just remnants from debugging...
signature.asc

Christophe Martin

unread,
Apr 9, 2013, 3:27:51 PM4/9/13
to begi...@perl.org
Hello,

Thank you Brandon McCaig and John Delacour for your answers. They both
point me to my mistakes.

The major problem was that I did not understand at all how perl
deals with strings.

First I have to use "use utf8" if i want perl to understand 'é'
and not 'Ã' '©' when I write 'é' in my script. I did not realise
because the terminal turns 'Ã' '©' back to 'é'.

Second, when I want to talk to a terminal, I have to tell perl what
encoding the terminal is expecting by using the correct IO encoding
layer for the terminal.

And finally, the encode_json and decode_json functions are designed
to be quick functions that can **export** or **import** utf data over
binary IO channel and NOT to/from a terminal that needs some
additionnal encoding layer. So I should not be surprised to see 'é'
instead of 'é' on a terminal.

In the end,

- "use utf8"
- binmode( \*STDOUT, ':utf8' ). And in fact binmode( $channel, ':utf8' )
for every channel that's opened to utf8 encoded text file.
- Only use encode_json and decode_json with binary communication channel
since they already perform the utf transformation.

Thanks again,
Christophe
0 new messages