question regarding rules and bytes vs characters

Ph. Marek

unread,

Jun 1, 2004, 1:56:41 AM6/1/04

to perl6-l...@perl.org

Hello everybody,

I'm about to learn myself perl6 (after using perl5 for some time).

One of my first questions deals with regexes.

I'd like to parse data of the form
Len: 15\n
(15 bytes data)\n
Len: 5\n
(5 bytes data)\n
\n
OtherTag: some value here\n
and so on, where the data can (and will) be binary.

I'd try for something like
my $data_tag= rule {
Len\: $len:=(\d) \n
$data:=([:u0 .]<$len>)\n # these are bytes
};

Is that correct?

And furthermore is perl6 said to be unicode-ready.
So I put the :u0-modifier in the data-regex; will that DWIM if I try to match
a unicode-string with that rule?

Is anything known about the internals of pattern matching whether the
hypothetical variables will consume (double) space?
I'm asking because I imagine getting a tag like "Len: 200000000" and then
having problems with 256MB RAM. Matching shouldn't be a problem according to
apo 5 (see the chapter "RFC 093: Regex: Support for incremental pattern
matching") but I'll maybe have troubles using the matched data?

Thank you for all answers!

Regards,

Phil

Larry Wall

unread,

Jul 9, 2004, 4:20:41 PM7/9/04

to Ph. Marek, perl6-l...@perl.org

On Tue, Jun 01, 2004 at 07:56:41AM +0200, Ph. Marek wrote:
: Hello everybody,

:
: I'm about to learn myself perl6 (after using perl5 for some time).

I'm also trying to learn perl6 after using perl5 for some time. :-)

: One of my first questions deals with regexes.

:
:
: I'd like to parse data of the form
: Len: 15\n
: (15 bytes data)\n
: Len: 5\n
: (5 bytes data)\n
: \n
: OtherTag: some value here\n
: and so on, where the data can (and will) be binary.
:
: I'd try for something like
: my $data_tag= rule {
: Len\: $len:=(\d) \n
: $data:=([:u0 .]<$len>)\n # these are bytes
: };
:
: Is that correct?

Pretty close. The way it's set up currently, $len is a reference
to a variable external to the rule, so $len is likely to fail under
stricture unless you've declared "my $len" somewhere. To make the
variable automatically scope to the rule, you have to use $?len
these days.

: And furthermore is perl6 said to be unicode-ready.

: So I put the :u0-modifier in the data-regex; will that DWIM if I try to match
: a unicode-string with that rule?

It should. However (and this is a really big however), you'll have
to be very careful that something earlier hasn't converted one form
of Unicode to another on you. For instance, if your string came in
as UTF-8, and your I/O layer translated it internally to UTF-32 or
some such, you're just completely hosed. When you're working at the
bytes level, you must know the encoding of your string.

So the natural reaction is to open your I/O handle :raw to get binary
data into your string. Then you try to match Unicode graphemes with [
:u2 . ] and discover that *that* doesn't work. Which is obvious when
you consider that Perl has no way of knowing which Unicode encoding
the binary data is in, so it's gonna consider it to be something like
Latin-1 unless you tell it otherwise. So you'll probably have to
cast the binary string to whatever its actual encoding is (potentially
lying about the binary parts, which we may or may not get away with,
depending on who validates the string when), or maybe we just need
to define rules like <utf16be_codepoint> and <utf8_grapheme> for use
under the :u0 regime.

: Is anything known about the internals of pattern matching whether the

: hypothetical variables will consume (double) space?
: I'm asking because I imagine getting a tag like "Len: 200000000" and then
: having problems with 256MB RAM. Matching shouldn't be a problem according to
: apo 5 (see the chapter "RFC 093: Regex: Support for incremental pattern
: matching") but I'll maybe have troubles using the matched data?

My understanding is that Parrot implements copy-on-write, so you should
be okay there.

: Thank you for all answers!

Even the late ones? :-)

Larry

Ph. Marek

unread,

Jul 12, 2004, 1:42:02 AM7/12/04

to perl6-l...@perl.org, Larry Wall

> : Hello everybody,
> :
> : I'm about to learn myself perl6 (after using perl5 for some time).
>
> I'm also trying to learn perl6 after using perl5 for some time. :-)

I wouldn't even try to compare you and me .... :-)

> Pretty close. The way it's set up currently, $len is a reference
> to a variable external to the rule, so $len is likely to fail under
> stricture unless you've declared "my $len" somewhere. To make the
> variable automatically scope to the rule, you have to use $?len
> these days.

ok.

> : And furthermore is perl6 said to be unicode-ready.
> : So I put the :u0-modifier in the data-regex; will that DWIM if I try to
> : match a unicode-string with that rule?
>
> It should. However (and this is a really big however), you'll have
> to be very careful that something earlier hasn't converted one form
> of Unicode to another on you. For instance, if your string came in
> as UTF-8, and your I/O layer translated it internally to UTF-32 or
> some such, you're just completely hosed. When you're working at the
> bytes level, you must know the encoding of your string.
>
> So the natural reaction is to open your I/O handle :raw to get binary
> data into your string. Then you try to match Unicode graphemes with [
> :u2 . ] and discover that *that* doesn't work. Which is obvious when
> you consider that Perl has no way of knowing which Unicode encoding
> the binary data is in, so it's gonna consider it to be something like
> Latin-1 unless you tell it otherwise. So you'll probably have to
> cast the binary string to whatever its actual encoding is (potentially
> lying about the binary parts, which we may or may not get away with,
> depending on who validates the string when), or maybe we just need
> to define rules like <utf16be_codepoint> and <utf8_grapheme> for use
> under the :u0 regime.

Of course the file must be opened in binary mode - else the line-endings etc.
can be destroyed in the binary data, which is bad.

So Perl/Parrot can't autodetect the kind of encoding.
But maybe it should be possible to do something like
[:utf16be_codepoint]? Len: $?len:=(\d+) \n
$?data:=([:raw .]<$len>) \n
ie. say that the conversion to unicode is optional??

> : Is anything known about the internals of pattern matching whether the
> : hypothetical variables will consume (double) space?
> : I'm asking because I imagine getting a tag like "Len: 200000000" and then
> : having problems with 256MB RAM. Matching shouldn't be a problem according
> : to apo 5 (see the chapter "RFC 093: Regex: Support for incremental
> : pattern matching") but I'll maybe have troubles using the matched data?
>
> My understanding is that Parrot implements copy-on-write, so you should
> be okay there.

ok, thank you.

> Even the late ones? :-)

even them - this is the *only* answer I received.

Again:

> : Thank you for all answers!

> Larry
Phil

Larry Wall

unread,

Jul 12, 2004, 2:45:48 PM7/12/04

to perl6-l...@perl.org

On Mon, Jul 12, 2004 at 07:42:02AM +0200, Ph. Marek wrote:
: Of course the file must be opened in binary mode - else the line-endings etc.

: can be destroyed in the binary data, which is bad.
:
: So Perl/Parrot can't autodetect the kind of encoding.
: But maybe it should be possible to do something like
: [:utf16be_codepoint]? Len: $?len:=(\d+) \n
: $?data:=([:raw .]<$len>) \n
: ie. say that the conversion to unicode is optional??

Yes, that's probably better than forcing an official encoding on something
that doesn't have a consistent encoding. Though I don't believe you want
the square brackets there. Something more like:

:utf16be_codepoint Len: $?len:=(\d+) \n
$?data:=(:byte . <$len>) \n

(since the :byte will scope to the capturing parens, and I imagine
$?len ends up being immediately typed as Unicode such that it can be
used as a number even in a :byte context.)

Or if you want the brackets for clarity:

[:utf16be_codepoint

Len: $?len:=(\d+) \n

$?data:=([:byte .] <$len>) \n
]

Probably :utf16be_codepoint wants to be written :utf16be:codepoint
or some such, since the encoding and the unicode abstraction level
are (mostly) orthogonal. Or maybe it's :code("utf16be"). Or even
better, maybe the encoding is an optional named parameter, as in
:code(:utf16be), where :code by itself defaults to :code(:utf8). That
extends nicely to things like :graph(:utf32) and :lang("de",:scsu),
where :lang requires the language to be specified, but can default
the encoding to something reasonable.

Hmm, maybe that means that language-dependent graphemes are called
"langs", which I suppose is short for "langemes".

I suppose that :byte could also take an argument to force a particular
old-style (single-byte) locale, if we choose to support them, and are
willing to take the consequences of Jarkko going postal. :-)

Larry

Austin Hastings

unread,

Jul 12, 2004, 3:40:45 PM7/12/04

to Larry Wall, perl6-l...@perl.org

--- Larry Wall <la...@wall.org> wrote:
>
> Hmm, maybe that means that language-dependent graphemes are called
> "langs", which I suppose is short for "langemes".

Dangerously close to "legumes", there. Perhaps we could refer to
entities matches by regexes as "peas"...

=Austin