# New Ticket Created by Jarkko Hietaniemi
# Please include the string: [perl #121937]
# in the subject line of all future correspondence about this issue.
# <URL:
https://rt.perl.org/Ticket/Display.html?id=121937 >
It seems that Perl is lagging on the handling for Unicode
"non-characters" [1]: they are these days valid for interchange:
http://www.unicode.org/versions/corrigendum9.html
In other words, they should be handled much like PUA (private use area)
characters [2]: passed through as-is.
How we are currently doing it wrong:
(a) ./perl -CO -we 'print chr(0xFFFF)'
Unicode non-character U+FFFF is illegal for open interchange at -e line 1.
�%
(Somewhat strangely, the -CO is required for the warning to appear.)
We shouldn't warn.
It is possible we still could warn somehow, to alert users about the
special nature of "non-characters" (a very unfortunate name), but they
are definitely legal characters, and they can be interchanged. (They
are not *intended* for interchange, but that is quite different from
"forbidden".)
(b) In Encode, the "utf8" lets the non-chars through, but the strict
"UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD:
./perl -Ilib -MEncode=decode -MDevel::Peek -we 'Dump(decode("utf8",
"\xEF\xBF\xBF"))'
SV = PV(0x7ffba18041f0) at 0x7ffba1803438
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"]
CUR = 3
LEN = 16
./perl -Ilib -MEncode=decode -MDevel::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))' {git: nonchar
SV = PV(0x7ff34104aa50) at 0x7ff341031f28
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"]
CUR = 3
LEN = 16
We shouldn't mangle.
---
[1]
http://www.unicode.org/faq/private_use.html#nonchar1
[2]
http://www.unicode.org/faq/private_use.html