[perl #121937] Unicode non-characters *are* valid for interchange

Jarkko Hietaniemi

unread,

May 21, 2014, 9:55:23 AM5/21/14

to bugs-bi...@rt.perl.org

# New Ticket Created by Jarkko Hietaniemi
# Please include the string: [perl #121937]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=121937 >

It seems that Perl is lagging on the handling for Unicode
"non-characters" [1]: they are these days valid for interchange:

http://www.unicode.org/versions/corrigendum9.html

In other words, they should be handled much like PUA (private use area)
characters [2]: passed through as-is.

How we are currently doing it wrong:

(a) ./perl -CO -we 'print chr(0xFFFF)'

Unicode non-character U+FFFF is illegal for open interchange at -e line 1.
�%

(Somewhat strangely, the -CO is required for the warning to appear.)

We shouldn't warn.

It is possible we still could warn somehow, to alert users about the
special nature of "non-characters" (a very unfortunate name), but they
are definitely legal characters, and they can be interchanged. (They
are not *intended* for interchange, but that is quite different from
"forbidden".)

(b) In Encode, the "utf8" lets the non-chars through, but the strict
"UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD:

./perl -Ilib -MEncode=decode -MDevel::Peek -we 'Dump(decode("utf8",
"\xEF\xBF\xBF"))'
SV = PV(0x7ffba18041f0) at 0x7ffba1803438
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"]
CUR = 3
LEN = 16
./perl -Ilib -MEncode=decode -MDevel::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))' {git: nonchar
SV = PV(0x7ff34104aa50) at 0x7ff341031f28
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"]
CUR = 3
LEN = 16

We shouldn't mangle.

---

[1] http://www.unicode.org/faq/private_use.html#nonchar1
[2] http://www.unicode.org/faq/private_use.html

Jarkko Hietaniemi

unread,

May 21, 2014, 10:30:04 AM5/21/14

to perl5-...@perl.org

On Wednesday-201405-21, 9:55, Jarkko Hietaniemi (via RT) wrote:
>
> How we are currently doing it wrong:

Should've said:

"Currently known wrongnesses include, but are probably not limited to"

Tony Cook via RT

unread,

May 21, 2014, 11:42:16 PM5/21/14

to perl5-...@perl.org

On Wed May 21 06:55:23 2014, jhi wrote:
> It seems that Perl is lagging on the handling for Unicode
> "non-characters" [1]: they are these days valid for interchange:
>
> http://www.unicode.org/versions/corrigendum9.html
>
> In other words, they should be handled much like PUA (private use area)
> characters [2]: passed through as-is.
>

> How we are currently doing it wrong:
>

This looks like a duplicate of

https://rt.perl.org/Ticket/Display.html?id=121226

Tony

---
via perlbug: queue: perl5 status: new
https://rt.perl.org/Ticket/Display.html?id=121937

Jarkko Hietaniemi

unread,

May 22, 2014, 8:32:18 AM5/22/14

to perlbug-...@perl.org, perl5-...@perl.org

On Wednesday-201405-21, 23:42, Tony Cook via RT wrote:
> This looks like a duplicate of
>
> https://rt.perl.org/Ticket/Display.html?id=121226

Yup, the same issue.

FWIW, I started poking at this.

Karl Williamson via RT

unread,

May 29, 2014, 3:20:10 PM5/29/14

to perl5-...@perl.org

I have now merged these two tickets. I've been thinking about and doing some research in the Unicode standard about this, and am having trouble with the idea that we should now just change to accept non-characters without warning.

Non-characters are still "permanently reserved for internal use", quoting from Corrigendum #9. I want to emphasize that word "internal". An application should be able to presume that data it receives from an external source does not contain non-characters, so it is free to use them in any way it wishes. This is the whole point of non-characters, to have some code points available for you that you are assured won't be coming from somewhere else.

And how do things come from somewhere else? through I/O. Hence, the presumption by Perl should be that I/O is related to an external interface. It may be that an application is composed of cooperating processes that communicate via I/O, but Perl's presumption must be, unless indicated otherwise, that I/O is for external interfaces.

An application that uses non-characters will want its inputs to not have any of them coming in to it. It wants them filtered out; the best choice is to have them turned into REPLACEMENT CHARACTERS. My claim is that Perl should do this by default. Corrigendum #9 doesn't change this. And there should be a way to change the default. That is what Corrigendum #9 makes clear, and which Perl already does in (too) many cases. That Corrigendum was not aimed at Perl, but other Unicode implementations. My point is that Perl already implements this Corrigendum, and need not nor should not change because of it.

We have long ago agreed that the default input for Perl should be strict, and that explicit action should be taken to override that. strict input should continue to exclude non-characters. If we were to change that, existing applications would be suddenly and silently exposed to security holes, where an attacker who knows the internal structure of the application inserts non-characters to fool it.

Let me reiterate my main point. We already implement Corrigendum #9. We should not make changes because of it.

Private-use characters are not the same as non-characters. An application has no right to presume that external inputs don't include private-use characters. But it is free to ascribe its own meanings to them. In practice, most applications will just treat them as some generic code points.

I think David Golden's ideas would be a useful addition, but it's not my itch. I would be happy to consult with someone who wishes to scratch it though
--
Karl Williamson

---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=121226

Jarkko Hietaniemi

unread,

May 29, 2014, 3:49:49 PM5/29/14

to perlbug-...@perl.org, perl5-...@perl.org

There's input and there's output.

I agree that default input should be strict: but I think stricter than
what we have now, e.g. not accept U+200000. And not accept non-chars.

(There's also more spectrum than just spewing warnings: currently we
generate U+FFFD but then *continue* reading. We could e.g. truncate
and stop reading, and/or croak...)

I am not entirely certain about the definition of "internal" here,
though. Internal to what? One "process"? What if Perl is just a
"library" and not an "application"? A set of Perl applications? A set
of mixed applications?

But on output if I output U+FFFF I don't want to output U+FFFD. (This
doesn't happen now, either: we just warn. But being strict the wrong
way, this could happen.) This is no different from chr(0xFFFF), really,
if I write that I don't want magic making it chr(0xFFFD).

Again, quoting the C9: "However, they are not illegal in interchange nor
do they cause ill-formed Unicode text. This has always been the intent
of the standard, as expressed by the Unicode Technical Committee." So
us currently warning non-chars being illegal for interchange is wrong.
They are not.

Jarkko Hietaniemi

unread,

May 29, 2014, 4:07:20 PM5/29/14

to perlbug-...@perl.org, perl5-...@perl.org

On Thursday-201405-29, 15:49, Jarkko Hietaniemi wrote:
> This is no different from chr(0xFFFF), really,
> if I write that I don't want magic making it chr(0xFFFD).

Or this:

perl -MEncode=decode -MDevel::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))'

giving me the bytes \xEF\xBF\xBD, aka U+FFFD.

Karl Williamson

unread,

May 29, 2014, 10:47:17 PM5/29/14

to j...@iki.fi, perlbug-...@perl.org, perl5-...@perl.org

On 05/29/2014 01:49 PM, Jarkko Hietaniemi wrote:
> There's input and there's output.
>
> I agree that default input should be strict: but I think stricter than
> what we have now, e.g. not accept U+200000. And not accept non-chars.
>

This has been hashed around a lot before, and I think every one now
agrees with you here.

> (There's also more spectrum than just spewing warnings: currently we
> generate U+FFFD but then *continue* reading. We could e.g. truncate
> and stop reading, and/or croak...)

Perhaps options.

>
> I am not entirely certain about the definition of "internal" here,
> though. Internal to what? One "process"? What if Perl is just a
> "library" and not an "application"? A set of Perl applications? A set
> of mixed applications?

That's why there has to be flexibility. We have to make the default the
sanest and safest, but allow the programmer(s) to override it for their
needs.

>
> But on output if I output U+FFFF I don't want to output U+FFFD. (This
> doesn't happen now, either: we just warn. But being strict the wrong
> way, this could happen.) This is no different from chr(0xFFFF), really,
> if I write that I don't want magic making it chr(0xFFFD).

Agreed. The reason we warn is so you know you're outputting something
somebody else likely wont be able to handle. The only time something
should be translated into FFFD is on input. I don't know about ENV or ARGV.

>
> Again, quoting the C9: "However, they are not illegal in interchange nor
> do they cause ill-formed Unicode text. This has always been the intent
> of the standard, as expressed by the Unicode Technical Committee." So
> us currently warning non-chars being illegal for interchange is wrong.
> They are not.

The wording should change, but I do believe there should be a warning
nonetheless. I do wish Unicode had phrased the original and the
Corrigendum better. They do seem to me to have an aversion to
straightforward language.