[pugs] regexp "bug"?

15 views
Skip to first unread message

BÁRTHÁZI András

unread,
Apr 15, 2005, 2:27:38 AM4/15/05
to perl6-c...@perl.org
Hi,

This code:

my $a='A';
$a ~~ s:perl5:g/A/{chr(65535)}/;
say $a.bytes;

Outputs "0". Why?

Bye,
Andras

BÁRTHÁZI András

unread,
Apr 15, 2005, 3:16:58 AM4/15/05
to perl6-c...@perl.org
Hi,

>> This code:
>>
>> my $a='A';
>> $a ~~ s:perl5:g/A/{chr(65535)}/;
>> say $a.bytes;
>>
>> Outputs "0". Why?
>
>

> \uFFFF is not a legal unicode codepoint. chr(65535) should raise an
exception of some type. So the above code does seem show a possible
bug. But as that chr(65535) is an undefined char, who knows what the
code is acually doing.


In my opinion (that can be wrong), \uFFFF can be stored as an UTF-8
character, it should be 0xEF~0xBF~0xBF. If I do it outside the regexp (I
mean "say chr(65535).bytes", it works well.

Another "bug", I've found, it's not related to the regexps, but still
unicode character one:

say chr(0x10FFFF).bytes;

The answer:

pugs: encodeUTF8: ord returned a value above 0x10FFFF

And if I start to increment $b, I will get:

pugs: Prelude.chr: bad argument

I don't understand it, as I thougth that unicode characters in the range
of 0x00000000-0x7FFFFFFF. Is Haskell not supporting the whole set?

There is a Unicode version, called UCS-2, that is just between
0x0000-0xFFFF, but it still not answer the question.

[...]

Meanwhile, I've found this:
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2175.htm

It can be the answer to my question.

Bye,
Andras


BÁRTHÁZI András

unread,
Apr 15, 2005, 4:04:32 AM4/15/05
to Mark A. Biggar, perl6-c...@perl.org
Hi,

> Yes, the value 0xFFFF can be stored as either 3 byte UTF-8 string or a 2
> byte UCS-2 value, but the Unicode standard specifically says that the
> values 0xFFFF, 0xFFFE and 0xFEFF are NOT valid codepoints and should
> never appear in a Unicode string. 0xFFFF is reserved for out-of-band
> signaling (such the -1 returnd by getc()) and 0xFFFE and 0xFEFF are
> specificaly reserved for out-of-band marking a UCS-2 file as being
> either bigendian or littlendian, but are specifically not considered
> part of the data. chr() is currently defined to mean convert an int
> value to a Unicode codepoint. That's why I said that chr(65535) should
> return an exception, it's an argument error similar to sqrt(-1).

Thanks, I didn't know about it. I thought they just not appear in UTF-8
coded strings, but you're right. I recommend it to raise an exception, too.

Bye,
Andras

Mark A. Biggar

unread,
Apr 15, 2005, 2:35:23 AM4/15/05
to BÁRTHÁZI András, perl6-c...@perl.org
BÁRTHÁZI András wrote:

\uFFFF is not a legal unicode codepoint. chr(65535) should raise an

exception of some type. So the above code does seem show a possible
bug. But as that chr(65535) is an undefined char, who knows what the
code is acually doing.

--
ma...@biggar.org
mark.a...@comcast.net

BÁRTHÁZI András

unread,
Apr 15, 2005, 4:42:33 AM4/15/05
to Mark A. Biggar, perl6-c...@perl.org
Hi,

>> my $a='A';
>> $a ~~ s:perl5:g/A/{chr(65535)}/;
>> say $a.bytes;
>>
>> Outputs "0". Why?
>

> \uFFFF is not a legal unicode codepoint. chr(65535) should raise an
> exception of some type. So the above code does seem show a possible
> bug. But as that chr(65535) is an undefined char, who knows what the
> code is acually doing.

It seems, that it gives back 0 in the 0xE000-0xFFFF range. Do you still
think, it's normal?

"Some Unicode code points are invalid and should not be used. [...] It
can't be 0xFFFF or 0xFFFE, it can't be both <= 0xDFFF and >= 0xD800, and
it can't be > 0x10FFFF and it can't be less than 0."

http://www.elfdata.com/plugin/unicodefaqdata.html

Bye,
Andras

Mark A. Biggar

unread,
Apr 15, 2005, 3:56:14 AM4/15/05
to BÁRTHÁZI András, perl6-c...@perl.org
BÁRTHÁZI András wrote:

Yes, the value 0xFFFF can be stored as either 3 byte UTF-8 string or a 2

byte UCS-2 value, but the Unicode standard specifically says that the
values 0xFFFF, 0xFFFE and 0xFEFF are NOT valid codepoints and should
never appear in a Unicode string. 0xFFFF is reserved for out-of-band
signaling (such the -1 returnd by getc()) and 0xFFFE and 0xFEFF are
specificaly reserved for out-of-band marking a UCS-2 file as being
either bigendian or littlendian, but are specifically not considered
part of the data. chr() is currently defined to mean convert an int
value to a Unicode codepoint. That's why I said that chr(65535) should
return an exception, it's an argument error similar to sqrt(-1).

--
ma...@biggar.org
mark.a...@comcast.net

h...@crypt.org

unread,
Apr 15, 2005, 8:27:34 AM4/15/05
to Mark A. Biggar, and...@barthazi.hu, perl6-c...@perl.org
"Mark A. Biggar" <ma...@floorboard.com> wrote:

In perl5 at least, we support a wider concept of codepoints than the
Unicode consortium. This allows us to use strings for a wider variety
of things than just Unicode text (eg version strings, bit vectors etc).

In perl6 the greatly expanded set of types will presumably allow us
to distinguish actual Unicode data from more arbitrary sequences of
codepoints, and I'd normally expect that the more constrained type
would be a subtype of the less constrained type. In this case that
means I'd expect "Unicode string" to be a subtype of something like
"codepoint sequence".

(In fact it'd probably be useful to have more levels than that - there
are times when you need the Unicode concepts for things like [[:digit:]],
but may be able to get better performance by avoiding the checks for
'legal Unicode codepoint'.)

On the other hand you will probably be able to achieve the things p5
overloads onto strings using packed integer arrays, so maybe this all
represents unnecessary complications. In which case maybe 'relaxed'
variants of Unicode strings aren't needed. We will probably still want
other sorts of strings though, such as ASCII.

Hugo

Larry Wall

unread,
Apr 15, 2005, 12:34:58 PM4/15/05
to perl6-c...@perl.org
On Fri, Apr 15, 2005 at 12:56:14AM -0700, Mark A. Biggar wrote:
: Yes, the value 0xFFFF can be stored as either 3 byte UTF-8 string or a 2
: byte UCS-2 value, but the Unicode standard specifically says that the
: values 0xFFFF, 0xFFFE and 0xFEFF are NOT valid codepoints and should
: never appear in a Unicode string. 0xFFFF is reserved for out-of-band
: signaling (such the -1 returnd by getc()) and 0xFFFE and 0xFEFF are
: specificaly reserved for out-of-band marking a UCS-2 file as being
: either bigendian or littlendian, but are specifically not considered
: part of the data. chr() is currently defined to mean convert an int
: value to a Unicode codepoint. That's why I said that chr(65535) should
: return an exception, it's an argument error similar to sqrt(-1).

It has to at least be possible to Think Bad Thoughts in Perl.
It doesn't have to be the default, though. But there has to be
some way of allowing illegal characters to be talked about, or
you can't write programs that talk about them. It's like saying
it's okay to be an executioner as long as you don't kill anyone...

Larry

Mark A Biggar

unread,
Apr 15, 2005, 1:12:54 PM4/15/05
to Larry Wall, perl6-c...@perl.org

Isn't that what the difference between byte-level and codepoint-level access to strings is all about. If you want to work with values that are illegal codepoints then you should be working at the byte-level not the codepoint-level, at least by default.

--
Mark Biggar
ma...@biggar.org
mark.a...@comcast.net
mbi...@paypal.com

Larry Wall

unread,
Apr 15, 2005, 2:55:53 PM4/15/05
to perl6-c...@perl.org
On Fri, Apr 15, 2005 at 05:12:54PM +0000, mark.a...@comcast.net wrote:

: Isn't that what the difference between byte-level and codepoint-level


: access to strings is all about. If you want to work with values that
: are illegal codepoints then you should be working at the byte-level
: not the codepoint-level, at least by default.

Sure, but there's no guarantee you have access to a lower level,
depending on the interface presented by the object in question, and
you shouldn't probably have to know that anyway, if there's a useful
abstraction level at which "illegal character" means something as
a unit to the higher level. The fact is that U+FFFF is an illegal
character regardless of the encoding, and I'd like to be able to
talk about it as a character, without having to know whether it's
an illegal UTF-8 byte sequence, or an illegal UTF-16 byte sequence,
or a 256-bit integer stored somewhere that you just aren't allowed
to think about certain values of.

In short, "legal" Unicode strings should probably be viewed as a
constrained subtype of strings, not as a storage type. I know you've
known Ada from its infancy. :-) Perl 6 makes the same distinction, and
can presumably get at the unconstrained type for any constrained type.
So if you hand me a Unicode string with arbitrary value restrictions,
there had better be a way to view that string without the arbitrary
restrictions. You need to be able to determine somehow that types
Even or Odd have a storage class of type Int.

Larry

Nicholas Clark

unread,
Apr 15, 2005, 3:47:29 PM4/15/05
to perl6-c...@perl.org
On Fri, Apr 15, 2005 at 09:34:58AM -0700, Larry Wall wrote:

> It doesn't have to be the default, though. But there has to be
> some way of allowing illegal characters to be talked about, or
> you can't write programs that talk about them. It's like saying

Thoughtcrime acceptable. Doubleplusgood.

Nicholas Clark

Reply all
Reply to author
Forward
0 new messages