Encode vs. JSON

David E. Wheeler

unread,

Jul 16, 2014, 6:03:14 PM7/16/14

to Perl5 Porters

Porters,

I have a script:

use v5.10;
use warnings;
use JSON;
use Encode qw(encode_utf8 decode_utf8);

my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
my $parser = JSON->new->utf8;

my $data = $parser->decode($json);
say encode_utf8 $data->{FFONTS};

On Perl 5.12 and earlier, this dies:

malformed UTF-8 character in JSON string, at character offset 23 (before "\x{ffff}ALIC"}")

It does not die on 5.14, which I assume is due to the addition of Unicode 6 support. But oddly, while JSON complains on 5.12 and earlier, Encode does not:

use v5.10;
use warnings;
use JSON;
use Encode qw(encode_utf8 decode_utf8);

my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
$json = decode_utf8 $json, Encode::FB_CROAK;

my $parser = JSON->new;

my $data = $parser->decode($json);
say encode_utf8 $data->{FFONTS};

This dies with the same error from JSON.pm, but note that the call to decode_utf8() worked. I’m left wondering why JSON and Encode seem to disagree on the validity of those bytes as UTF-8 in Perl 5.12. Ideas?

Thanks,

David

signature.asc

Aristotle Pagaltzis

unread,

Jul 17, 2014, 11:00:55 PM7/17/14

to David E. Wheeler, Perl5 Porters

Hi David,

* David E. Wheeler <da...@justatheory.com> [2014-07-17 00:05]:

> I have a script:
>
> use v5.10;
> use warnings;
> use JSON;
> use Encode qw(encode_utf8 decode_utf8);
>
> my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
> my $parser = JSON->new->utf8;
>
> my $data = $parser->decode($json);
> say encode_utf8 $data->{FFONTS};
>
> On Perl 5.12 and earlier, this dies:
>
> malformed UTF-8 character in JSON string, at character offset 23 (before "\x{ffff}ALIC"}")
>
> It does not die on 5.14, which I assume is due to the addition of
> Unicode 6 support.

why do you assume that? As far as I can tell, Unicode 6 has no changes
of any kind WRT U+FFFF.

> But oddly, while JSON complains on 5.12 and earlier, Encode does not:
>
> use v5.10;
> use warnings;
> use JSON;
> use Encode qw(encode_utf8 decode_utf8);
>
> my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
> $json = decode_utf8 $json, Encode::FB_CROAK;
>
> my $parser = JSON->new;
>
> my $data = $parser->decode($json);
> say encode_utf8 $data->{FFONTS};
>
> This dies with the same error from JSON.pm, but note that the call to
> decode_utf8() worked. I’m left wondering why JSON and Encode seem to
> disagree on the validity of those bytes as UTF-8 in Perl 5.12. Ideas?

Sounds to me like it’s the behaviour of JSON that changes between 5.12
and 5.14 rather than that of Encode?

What I can say is that U+FFFF is a non-character, but EF BF BF is the
correct encoding of that codepoint. Using decode_utf8(...) is short for
decode("utf8", ...), which is completely permissive. As long as it can
decode the octet sequence according to the UTF-8 encoding, it will not
complain. In contrast, if you do decode("UTF-8", ...) then you will get
charset checking too. And *that* *will* reject your attempt to smuggle
a U+FFFF into the string.

So that’s why Encode behaves as it does.

Why does JSON go from rejecting to accepting the string if you go from
5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
to the other based on the version of JSON; you haven’t specified whether
you have the same version of it installed in your 5.12 vs 5.14 perls.)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

David E. Wheeler

unread,

Jul 18, 2014, 1:19:22 AM7/18/14

to Aristotle Pagaltzis, Perl5 Porters

On Jul 17, 2014, at 8:00 PM, Aristotle Pagaltzis <paga...@gmx.de> wrote:

> Hi David,

Hey Aristotle, many thanks for your reply. Super helpful.

>> It does not die on 5.14, which I assume is due to the addition of
>> Unicode 6 support.
>
> why do you assume that? As far as I can tell, Unicode 6 has no changes
> of any kind WRT U+FFFF.

It was a guess.

> Sounds to me like it’s the behaviour of JSON that changes between 5.12
> and 5.14 rather than that of Encode?

Yes.

> What I can say is that U+FFFF is a non-character, but EF BF BF is the
> correct encoding of that codepoint. Using decode_utf8(...) is short for
> decode("utf8", ...), which is completely permissive. As long as it can
> decode the octet sequence according to the UTF-8 encoding, it will not
> complain. In contrast, if you do decode("UTF-8", ...) then you will get
> charset checking too. And *that* *will* reject your attempt to smuggle
> a U+FFFF into the string.

Ah, yes, quite right. I keep forgetting that utf8 is so permissive.

> So that’s why Encode behaves as it does.

So this data came from a Java app, which serialized the string "HOLIDAYBOLDI\xEF\xBF\xBFALIC" into JSON. This tells me that our Java app needs to be a little more careful about what it considers UTF-8, and perhaps replace bogus characters/bytes. But I am unable to get it to choke on \uFFFF at all on Java 6 or 7. This does not throw an exception:

"\uFFFF".getBytes("UTF-8");

I Googled around a bit, and found this SO answer:

http://stackoverflow.com/a/16619933/79202

Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?

> Why does JSON go from rejecting to accepting the string if you go from
> 5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
> to the other based on the version of JSON; you haven’t specified whether
> you have the same version of it installed in your 5.12 vs 5.14 perls.)

I used JSON 2.90 and JSON::XS 3.01 in all my tests.

Best,

David

signature.asc

David E. Wheeler

unread,

Jul 18, 2014, 2:37:03 AM7/18/14

to Aristotle Pagaltzis, Perl5 Porters

On Jul 17, 2014, at 10:19 PM, David E. Wheeler <da...@justatheory.com> wrote:

> Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?

Actually, now that I think about it, it seems that JSON on Perl 5.14 and higher has already relaxed that distinction. It’s only Encode that is still strict about non-characters.

Best,

David

signature.asc

Aristotle Pagaltzis

unread,

Jul 18, 2014, 3:56:43 AM7/18/14

to David E. Wheeler, Perl5 Porters

Hi David,

* David E. Wheeler <da...@justatheory.com> [2014-07-18 08:40]:

> On Jul 17, 2014, at 10:19 PM, David E. Wheeler <da...@justatheory.com> wrote:

> > Which suggests that, according to [Corrigendum
> > 9](http://www.unicode.org/versions/corrigendum9.html), reserved
> > non-characters now *are* allowed to appear in a UTF-8 string. Which
> > makes me think I will never be able to get the Java server to clean
> > up its act. Should Perl, Encode, and JSON relax things a bit with
> > regard to these characters, then?
>

> Actually, now that I think about it, it seems that JSON on Perl 5.14
> and higher has already relaxed that distinction. It’s only Encode that
> is still strict about non-characters.

there is a ticket about that:
https://rt.perl.org/Public/Bug/Display.html?id=121937

* David E. Wheeler <da...@justatheory.com> [2014-07-18 07:20]:

So that leaves the question open as it was: why does JSON.pm exhibit one
behaviour under 5.12 and another under 5.14?

David E. Wheeler

unread,

Jul 20, 2014, 12:58:51 AM7/20/14

to Aristotle Pagaltzis, Perl5 Porters

On Jul 18, 2014, at 12:56 AM, Aristotle Pagaltzis <paga...@gmx.de> wrote:

> there is a ticket about that:
> https://rt.perl.org/Public/Bug/Display.html?id=121937

Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:

perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.

In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.

> So that leaves the question open as it was: why does JSON.pm exhibit one
> behaviour under 5.12 and another under 5.14?

Yes, very curious.

Best,

David

signature.asc

David E. Wheeler

unread,

Jul 21, 2014, 2:22:23 PM7/21/14

to Aristotle Pagaltzis, Perl5 Porters

On Jul 19, 2014, at 9:58 PM, David E. Wheeler <da...@justatheory.com> wrote:

>> there is a ticket about that:
>> https://rt.perl.org/Public/Bug/Display.html?id=121937
>
> Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:
>
> perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
> utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.
>
> In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.

I did so, here:

https://rt.cpan.org/Ticket/Display.html?id=97358

Dan replied to report that it’s UTF8_DISALLOW_ILLEGAL_INTERCHANGE from the Perl core that’s at fault:

> If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core:
>
> http://perldoc.perl.org/perlapi.html#Unicode-Support
>
> In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.

ISTM that, given the change in Corrigendum 9, UTF8_DISALLOW_ILLEGAL_INTERCHANGE should exclude UTF8_DISALLOW_NONCHAR.

Is this part of of the same issue as that described in RT-97358? Or should I start a new issue?

Best,

David

signature.asc

Karl Williamson

unread,

Jul 22, 2014, 11:55:07 PM7/22/14

to David E. Wheeler, Aristotle Pagaltzis, Perl5 Porters

On 07/21/2014 12:22 PM, David E. Wheeler wrote:
> On Jul 19, 2014, at 9:58 PM, David E. Wheeler <da...@justatheory.com> wrote:
>
>>> there is a ticket about that:
>>> https://rt.perl.org/Public/Bug/Display.html?id=121937
>>
>> Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:
>>
>> perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
>> utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.
>>

>> In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. Iï¿½ll file a bug with Dan.

>
> I did so, here:
>
> https://rt.cpan.org/Ticket/Display.html?id=97358
>

> Dan replied to report that itï¿½s UTF8_DISALLOW_ILLEGAL_INTERCHANGE from the Perl core thatï¿½s at fault:

>
>> If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core:
>>
>> http://perldoc.perl.org/perlapi.html#Unicode-Support
>>
>> In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.
>
>
> ISTM that, given the change in Corrigendum 9, UTF8_DISALLOW_ILLEGAL_INTERCHANGE should exclude UTF8_DISALLOW_NONCHAR.
>
> Is this part of of the same issue as that described in RT-97358? Or should I start a new issue?
>
> Best,
>
> David
>

We have a backwards compatibility problem here. Corrigendum 9 is
controversial, and the wording has not been incorporated into the text
of Unicode 7.0 because that hasn't been published yet (the data has, but
not the text of the standard).

Noncharacters are still supposed to be used only for internal purposes.
The genesis of #9 was that ICU and CLDR were having trouble with
off-the-shelf editors and version control systems rejecting their code
that used them legitimately (though it appears that there are some poor
design decisions involving their use).

I sent a query about things to the Unicode mailing list some months ago,
and it stirred up quite a bit of resentment about the #9 decision. It
was made without public input, and during a single meeting, so there
wasn't time to consider all the ramifications.

One of my points was that we have a gatekeeper that has kept
non-characters out of input. Code that uses non-characters internally
has relied on that gatekeeper to prevent conflicts. If we change the
gatekeeper to allow noncharacters, there is a potential security hole.
Even the people on the Unicode list that were the promulgators of the
change given by #9 agree that any existing code that excludes
noncharacters should not be changed to allow them.

David E. Wheeler

unread,

Jul 23, 2014, 1:38:40 AM7/23/14

to Karl Williamson, Aristotle Pagaltzis, Perl5 Porters

On Jul 22, 2014, at 8:55 PM, Karl Williamson <pub...@khwilliamson.com> wrote:

> We have a backwards compatibility problem here. Corrigendum 9 is controversial, and the wording has not been incorporated into the text of Unicode 7.0 because that hasn't been published yet (the data has, but not the text of the standard).
>
> Noncharacters are still supposed to be used only for internal purposes. The genesis of #9 was that ICU and CLDR were having trouble with off-the-shelf editors and version control systems rejecting their code that used them legitimately (though it appears that there are some poor design decisions involving their use).
>
> I sent a query about things to the Unicode mailing list some months ago, and it stirred up quite a bit of resentment about the #9 decision. It was made without public input, and during a single meeting, so there wasn't time to consider all the ramifications.

Huh. So much tempest!

> One of my points was that we have a gatekeeper that has kept non-characters out of input. Code that uses non-characters internally has relied on that gatekeeper to prevent conflicts. If we change the gatekeeper to allow noncharacters, there is a potential security hole. Even the people on the Unicode list that were the promulgators of the change given by #9 agree that any existing code that excludes noncharacters should not be changed to allow them.

Well, for now, for my purposes, I put this into our code:

use constant PERL514 => $] >= 5.014;
# ... later in that same file…
unless (PERL514) {
# Replace noncharacters with the UNICODE REPLACEMENT character.
$json =~ s/\xEF(?:\xBF[\xBF\xBE]|\xB7[\x90-\xAF])/\xEF\xBF\xBD/g;
}

Which fixes the immediate issue for us on 5.10.1 (Thanks RedHat!) and should allow it to keep working once we get on a more modern Perl. This is because JSON(::XS)? on 5.14 and higher is okay with noncharacters, even if `decode("UTF-8", $json)` isn’t.

As for where the “EF BF BF” is coming from: JavaScript and Flash running in a browser. Cool, right? FWIW, neither Java, JavaScript, nor Postgres complain about this noncharacter. I guess they tend to behave more like `decode_utf8`.

As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).

Thanks for the detailed reply.

Best,

David

signature.asc

David E. Wheeler

unread,

Sep 18, 2014, 8:28:07 PM9/18/14

to Karl Williamson, Aristotle Pagaltzis, Perl5 Porters

On Jul 22, 2014, at 10:38 PM, David E. Wheeler <da...@justatheory.com> wrote:

> As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).

Karl, what say you?

Best,

David

signature.asc

Karl Williamson

unread,

Sep 18, 2014, 8:59:44 PM9/18/14

to David E. Wheeler, Aristotle Pagaltzis, Perl5 Porters

On 09/18/2014 06:28 PM, David E. Wheeler wrote:
> On Jul 22, 2014, at 10:38 PM, David E. Wheeler <da...@justatheory.com> wrote:
>

>> As Iï¿½ve solved my immediate problem, Iï¿½m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if youï¿½d like, access controls allowing).

>
> Karl, what say you?
>
> Best,
>
> David
>

Background: It turns out that the Corrigendum #9 is controversial in
the Unicode community. It was done during the course of a single
meeting, and not subjected to the usual public review. The wording of
the Standard in regards to this has not been finalized.

We cannot just change this. It would open up security holes.
Applications likely have been written assuming Non-characters will not
be in the input, and thus are usable as sentinels, without fear of
encountering one from user-data. If we were to make this change that
would no longer be true, and a long-standing module could silently be
exposed to an attack.

The feedback from Unicode on this was unanimous, even from the people
who were the ones who pushed for #9. If you have an existing library
(as essentially we do) that excluded non-chars, you have to continue to
exclude them to prevent security holes from opening up.

The way out of this is to have some API to tell Encode that
non-characters are acceptable.

David E. Wheeler

unread,

Sep 19, 2014, 12:08:26 PM9/19/14

to Karl Williamson, Aristotle Pagaltzis, Perl5 Porters, Dan Kogai

On Sep 18, 2014, at 5:59 PM, Karl Williamson <pub...@khwilliamson.com> wrote:

> The way out of this is to have some API to tell Encode that non-characters are acceptable.

Encode-only? Is there a way to do it with the IO layers? Or is that just Encode, too?

Dan, should we re-open this bug to request an interface for setting telling UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR?

https://rt.cpan.org/Ticket/Display.html?id=97358#txn-1388686

Thanks,

David

signature.asc

Karl Williamson

unread,

Sep 19, 2014, 12:23:19 PM9/19/14

to David E. Wheeler, Aristotle Pagaltzis, Perl5 Porters, Dan Kogai

On 09/19/2014 10:08 AM, David E. Wheeler wrote:
> Encode-only? Is there a way to do it with the IO layers? Or is that just Encode, too?

I keep hoping we will soon get a new :utf8 layer that will allow this
sort of thing.