Suppose you have a browser (which understands "language" traits) or a word processor (which stores "style" and/or "font" information) that is storing some not-text-only string-like things via scalar strings+ or objectrefs.
You want to do something like "search for all occurrences of the word 'From:' in a heading style" or "Find all letters 'l' in french text".
How do you write, and how do you code, the rule(s) for that?
I think it could be a rule junction, as C< /all(<french>, 'l')/ > but that's not entirely satisfying since I don't imagine that rule junctions are going to be the most efficient constructs around. (But would a rulejunction be a valid way of searching?)
Alternatively, there would need to be some way of inquiring about distributed traits. That is, a trait that wasn't actually applied to every single member of a list, but which was "inferred" by some magic accessor. (IOW, the "string" object defines a special version of the trait accessor method (.AUTOTRAIT anyone?) that knows how to query to see if there is a <french>...</french> tagset surrounding this text, or whatever.
With that, you could define a rule called "<french>" and called "</french>" that clevery look like XML but invoke the rules. Something like:
m« <french>l</french> »
This has the twin virtues of (1) looking cool; and (2) being really self-explanatory. But, how would you code a rule pair like french and /french?
If that's not doable, is there some other way, especially some variable way, of checking for "traits" at the same time you're matching patterns? (I.e., $language instead of <french>)
=Austin
Okay, I supposin'. But I'd rather not call them traits, since that
already means two other things right now. Properties is more like...
: You want to do something like "search for all occurrences of the word
: 'From:' in a heading style" or "Find all letters 'l' in french text".
:
: How do you write, and how do you code, the rule(s) for that?
Depends on how you think of the embedded objects.
: I think it could be a rule junction, as C< /all(<french>, 'l')/ >
: but that's not entirely satisfying since I don't imagine that rule
: junctions are going to be the most efficient constructs around. (But
: would a rulejunction be a valid way of searching?)
Not written like that. At minimum you'd have to put a colon on the
front of that to make it :all. Except that :all is already taken...
I did get a request once for & to do the opposite of | though.
And one could make an argument that we should reserve :all, :any,
:one and :none for junctional utterances. In which case what we
currently call :all should probably be :every or :exhaustive or
something else guaranteed to be confused with :e. :-)
: Alternatively, there would need to be some way of inquiring about
: distributed traits. That is, a trait that wasn't actually applied to
: every single member of a list, but which was "inferred" by some magic
: accessor. (IOW, the "string" object defines a special version of the
: trait accessor method (.AUTOTRAIT anyone?) that knows how to query to
: see if there is a <french>...</french> tagset surrounding this text,
: or whatever.
: With that, you could define a rule called "<french>" and called
: "</french>" that clevery look like XML but invoke the rules. Something
: like:
: m« <french>l</french> »
: This has the twin virtues of (1) looking cool; and (2) being really
: self-explanatory. But, how would you code a rule pair like french
: and /french?
That...makes my head hurt. And will probably make Perl's head hurt too.
: If that's not doable, is there some other way, especially some
: variable way, of checking for "traits" at the same time you're matching
: patterns? (I.e., $language instead of <french>)
If embedded objects are just considered strange characters, and
characters are just considered strange objects, then the most
straightforward way to get object/character properties with set
operations is through the mechanisms that are already there.
For example, to find a french word using character property sets:
/<<alpha> & <french>>+/;
Your specific example is little more complicated. Though of course,
since "I" is one letter, one could in this particular case write:
/<[I] & <french>>+/;
The general solution, however, is:
/(From\:) <( $1 ~~ /^<headingchar>+$/ )>/
Which seems a bit suboptimal. The proposed & counterpart to | could
help here:
/ From\: & <headingchar>+ /
the point of & being that all its subpatterns have to start and stop
at the same spot, or it's not a match. In the way it was originally
posed to me, it was a bioinformatics problem where you want to say
something like:
/$startseq [ $seqA & $seqB ] $finalseq/
except that that's implying some scanning that the regex engine wouldn't
do by default. You'd have to say something like:
/$startseq [ .*? $seqA .*? & .*? $seqB .*? ] $finalseq/
And now you can see how it would be very easy to abuse & badly in terms
of performance. The above could easily be O(n**4) unless the optimizer
was extremely cagey in factoring out the wildcards into something like:
/$startseq .*? [
[$seqA .*? & .*? $seqB ] |
[$seqB .*? & .*? $seqA ]
] .*? $finalseq/
That's still gonna stress the regex engine though. The efficient way
to solve this particular problem, assuming that $finalseq doesn't
match everywhere, is this:
/$startseq (.*?) $finalseq <( $1 ~~ /$seqA/ && $1 ~~ /$seqB/ )>/
It's like ordering your expensive tests after your cheap tests in
if foo() and baz() and bar()
I suppose the regex compiler could guess that a pattern like
A [ B & C ] D
should be tested
if A and D and [ B & C ]
But that gets blown to smithereens if D relies on a backref to B or C.
So does any implementation that tries to turn [ B & C ] into a one-pass
state machine.
Still, just because a feature can be abused doesn't mean that it
shouldn't go in. There's a lot to be said for being able to write
things like:
[ <ident> & <ascii>+ ]
Now I'm supposing that & binds tighter than | as usual, so the
brackets wouldn't always be necessary:
<ident> & <french>+
|
<ident> & <swahili>+
Larry
Although, of course, that should probably be written:
<ident> & [ <french>+ | <swahili>+ ]
or really, just
<ident> & <<french>|<swahili>>+
That last is likely to be the fastest, since a decent implementation
of character properties should cache swatches of the bitmap like Perl 5
does, or at least memoize something somewhere to keep from having
to recalculate what's french and what's swahili...
Larry
> There's a lot to be said for being able to write
> things like:
>
> [ <ident> & <ascii>+ ]
>
> Now I'm supposing that & binds tighter than | as usual, so the
> brackets wouldn't always be necessary:
>
> <ident> & <french>+
> |
> <ident> & <swahili>+
FWIW, I'm strongly in favour of adding & to rules.
Indeed, if Larry were to give the word, I'd be delighted to add support for it
to the Perl6::Rules module.
Damian
Execute! (I hope that's the right word...)
Larry
> : Indeed, if Larry were to give the word, I'd be delighted to add support for
> : it to the Perl6::Rules module.
>
> Execute! (I hope that's the right word...)
I believe, Captain, the correct word would be: "Engage!"
Data^H^Hmian