Parsing data

Aaron Sherman

unread,

Oct 7, 2009, 6:45:06 PM10/7/09

to Perl6 Language List

One of the first things that's becoming obvious to me in playing with
Rakudo's rules is that parsing strings isn't always what I'm going to
want to do. The most common example of wanting to parse data that's
not in string form is the YACC scenario where you want to have a
function produce a stream of tokenized data that is then parsed into a
more complex representation. In similar fashion there's transformers
like TGE that take syntax trees and transform them into alternative
representations.

To that end, I'd like to suggest (for 6.1 or whatever comes after
initial stability) an extension to rules:

[ 'orange', 'apple', 'apple', 'pear', 'banana' ] ~~ rx :data {
'apple'+ 'pear' }

Adding :data forces the match to proceed against the elements of the
supplied array as a sequence, rather than as individual matches the
way it behaves now. Each element of the array is matched by each atom
in the expression.

To support complex data (instead of matching all elements as fixed
strings), a new matching syntax is proposed. The "current object" in
what follows is the object in the input data which is currently being
treated as an atom (e.g. an array element). It might be any kind of
data such as a sub-array, number or string.

<^...> matches against embedded, complex data. There are several forms
depending on what comes after the ^:

Forms that work on the current element of the input:

^{...} smart-matches current object against return value of closure
^~exp parses exp as a regex and matches as a string against the
current object (disabling :data locally)
^::exp exp is an identifier and smart-matches on type

Note that the second two forms can be implemented (though possibly not
optimally) using the first.

These forms treat the current element of the input as a sub-array and
attempt to traverse it, leaving :data enabled:

^[exp] parses exp as a regex and matches against an array object
^ name (note space) identical to <^[<name>]>

Example:

This parses a binary operator tree:

token undef { <^{undef}> }
token op { < + - * / > } # works because the whole object is a
one-character string
token term { <^::Num> | <^~ \d+ > | <undef> } # number, string with
digits or undef
rule binoptree {
<op>
$<left> = [ <term> | <^ binoptree> ]
$<right> = [ <term> | <^ binoptree> ]
}

[ '+', 5, [ '*', 6, 7 ] ] ~~ rx :data /<binoptree>/

Some notes: perhaps this should simply refer to iterable objects and
not arrays? Is there a better way to unify the handling of matching
against the current object vs matching against embedded structures?
What about matching nested hashes?

What I find phenomenal is that this requires so little change to the
existing spec for rules. It's a really simple approach, but give us
the ability to start applying rules in all sorts of ways we never
dreamed of before.

I might even tackle trying to implement this instead of the parser
library I was working on if there's some agreement that it makes sense
and looks like the correct way to go about it....

Moritz Lenz

unread,

Oct 7, 2009, 6:50:34 PM10/7/09

to Aaron Sherman, Perl6 Language List

Aaron Sherman wrote:
> One of the first things that's becoming obvious to me in playing with
> Rakudo's rules is that parsing strings isn't always what I'm going to
> want to do. The most common example of wanting to parse data that's
> not in string form is the YACC scenario where you want to have a
> function produce a stream of tokenized data that is then parsed into a
> more complex representation. In similar fashion there's transformers
> like TGE that take syntax trees and transform them into alternative
> representations.
>
> To that end, I'd like to suggest (for 6.1 or whatever comes after
> initial stability) an extension to rules:

Did you read
http://perlcabal.org/syn/S05.html#Matching_against_non-strings already?

It's not yet implemented by any compiler, but seems to (mostly) do what
you want already, no?

Cheers,
Moritz

Aaron Sherman

unread,

Oct 7, 2009, 7:18:21 PM10/7/09

to Moritz Lenz, Perl6 Language List

Sorry, I accidentally took the thread off-list. Re-posting some of my
comments below:

On Wed, Oct 7, 2009 at 6:50 PM, Moritz Lenz <mor...@faui2k3.org> wrote:
> Aaron Sherman wrote:
>> One of the first things that's becoming obvious to me in playing with
>> Rakudo's rules is that parsing strings isn't always what I'm going to
>> want to do. The most common example of wanting to parse data that's
>> not in string form is the YACC scenario where you want to have a
>> function produce a stream of tokenized data that is then parsed into a
>> more complex representation. In similar fashion there's transformers
>> like TGE that take syntax trees and transform them into alternative
>> representations.
>>
>> To that end, I'd like to suggest (for 6.1 or whatever comes after
>> initial stability) an extension to rules:
>
> Did you read
> http://perlcabal.org/syn/S05.html#Matching_against_non-strings already?

(I went off and read that, and then replied to Moritz):

OK, no. That proposal only does part of the work. It would suffice for
something like the lexer/parser connectivity, but transforming complex
data structures would be impossible. By contrast, what I've suggested
would work for both cases. It also preserves the existing <a b c> ~~
/b/ functionality that we have today, and it's not entirely clear to
me that the proposal that you linked to does.

So, to re-cap:

:data turns on data mode which treats the input as an iterator and
matches each atom in the regex against objects returned by the
iterator (must be rewindable? or do we cache when it's not?)

Then, inside the regex we use <^...> to match complex data such as:

<^~ ... > - match digits in a single element (equiv to <,> \d+ <,> in
the proposal you linked), with :data turned off
<^{ ... }> - smart match the return value of the closure against current element
<^::identifier> - Smart match a type against the current element
<^[...]> - Descend the current element, which should be iterable
itself and match with :data turned on
<^ name> - Same as <^[<name>]>

This should be powerful enough to match any arbitrarily nested set of
iterable objects. I think it will be particularly useful against parse
trees (and similar structures such as XML/HTML DOMs) and scanner
productions, though users will probably find nearly infinite uses for
it, much like original regular expressions.

David Green

unread,

Oct 8, 2009, 12:57:30 AM10/8/09

to Perl6 Language List

On 2009-Oct-7, at 5:18 pm, Aaron Sherman wrote:
> This should be powerful enough to match any arbitrarily nested set of
> iterable objects. I think it will be particularly useful against parse
> trees (and similar structures such as XML/HTML DOMs) and scanner
> productions, though users will probably find nearly infinite uses for
> it, much like original regular expressions.

I agree that being able to parse data structure would be *extremely*
useful. (I think I posted a suggestion like that at one time, though
I didn't propose any syntax.) There is already a way to parse data --
Signatures, but not with the full power that grammars can apply to text.

Data-grammars could do everything that Signatures do, and more, though
it's still worth having special syntax designed to fit the special and
ubiquitous case of sub signatures. Trying to expand signature-syntax
to cover everything grammars can do would probably end up too ugly;
nevertheless, if we had full grammars to build on, I'm sure the Sig-
syntax could be extended quite a lot before it got too cumbersome. It
would also open the way for people to build custom sig-parsers (with
their own special syntax or not).

It also might be worth inventing a whole new syntax design for parsing
and manipulating data structures, but your suggested extensions seem
pretty good to me.

-David

Aaron Sherman

unread,

Oct 8, 2009, 2:10:34 AM10/8/09

to Perl6 Language List

On Thu, Oct 8, 2009 at 12:57 AM, David Green <david...@telus.net> wrote:

> I agree that being able to parse data structure would be *extremely* useful.
> (I think I posted a suggestion like that at one time, though I didn't
> propose any syntax.) There is already a way to parse data -- Signatures,
> but not with the full power that grammars can apply to text.

I'm certain I'm not the first person to think of it, especially since
the link that Moritz provided came so close.

Just to get a sense of the scope of it, I've been playing around with
the code tonight. Modifying PGE to handle the syntactic changes was
trivial, but of course, it's the semantic shift of allowing the input
to a regex to be a data structure other than a string that will take
some digging around and re-tooling (many assumptions need to be
re-visited like what "pos" means in the context of a multi-level
datastructure match!)

I'll have another go at that tomorrow and see how much work it's
likely to be, but I'm still thinking this is something to push forward
into 6.1, and not try to hold up any work on 6.0 for.

I suppose if I want to be nice to others, I should come up with a
patch against the STD as well, since there's now an active project
using it to compile code.