I really wasn't being very clear about this. For efficiency we may
need "trie" support (or something like it) to match various strings
in parallel. My point is that it could very well be the case that
character classes are just a specific application of this.
Larry
> I see one other potential gotcha with respect to backtracking and
> closures. In P6, a closure can declare a hypothetical variable
> that is restored only if the closure exits "unsuccessfully". Within
> a rule, an embedded closure is unsuccessful if it is backtracked over.
> But that implies that you can't know whether you have a successful
> return until the entire regex is matched, all the way down, and all the
> way back out the top, or at least out far enough that you know you
> can't backtrack into this closure. Abstractly, the closure doesn't
> return until the entire rest of the match is decided. Internally,
> of course, the closure probably returns as soon as you run into the
> end of it.
Let's get concrete:
rule foo { a $x:=(b*) c }
"abbabc"
So, if I understand Parrot and Perl 6 correctly (heh, fat chance), a
slight modification to the calling convention of the closure that
represents a rule (possibly even a raw .Closure) could add a pad that
the callee is expected to fill in with any hypotheticals defined during
execution. The following would happen in the example above:
store_lex "bb" into hypopad("$x") after "abb"
find "a" and fail the rule, backtracking (clear hypopad("$x"))
store_lex "b" into hypopad("$x") after backtracking over one "b"
find "b" next and fail the rule, backtracking again (clear)
store_lex "b" into hypopad("$x") after second "ab"
find "c" and succeed rule foo, return hypopad
Essentially every close-paren triggers binding, and every back-track
over a close-paren triggers clearing.
Because this is all part of the calling convention for a rule, there's
no difference between a rule "passing" back hypotheticals to its caller
and a sub-rule doing so to the rule which called IT.
Is that workable? Does it address your concern, Larry, or did I miss
your point?
> rule foo { a $x:=(b*) c }
In the rest of my message I acted as if that read:
rule foo { a $x:=(b+) c }
so, we may as well pretend that that's what I meant to say ;-)
Indeed, what I've been working towards in my head is that a rule
would consist of a set of "components" which can nest and can refer to
other rules, and each component would simply keep track of its
current start and end positions of what it matches in the string,
as well as what to do if that component needs to backtrack/continue
because a successive component is unable to achieve its part of the match
(or if we're looking for an C<:exhaustive> set of matches). As Larry
mentioned in a previous conversation, the rules engine will need to
need to remember its state without keeping a large stack (fortunately
there appear to be a number of possible optimizations here).
So, what a rule component really wants to do is to be able to perform
comparisons and tests of substrings in-place, as opposed to extracting
them to perform the match. Opcodes to support that would be most
helpful (they may already be there--I just haven't looked at that part
yet).
And yes, my current state of thinking is that each rule component
is a Parrot object of some sort that knows where it fits in the
rule and can pass information and control flow to/from its neighbors
in trying to get the rule(s) to match the target string.
> : *) find first character of class X in string
> : *) find first character not of class X in string
> We're gonna run into the "what's a character?" issue here, especially
> at higher Unicode levels where what the user things of as a single
> character is really a sequence of codepoints.
> : [...]
> : *) create new class X
> : *) add or subtract character to class X
> : *) create union|intersection|difference of two classes
> Not sure you really need opcodes for those if character classes are
> just real objects with a particular interface.
Indeed, I was thinking that character classes would just be more
instances of rule component objects, and these would have methods
for building unions/intersections/differences. Or, the rules compiler
itself can do much of this when it constructs the character class
components, rather than try to build them dynamically in Parrot.
> (Also, the Perl 6 parser may be based on the notion of a set of hash keys
> being treated a bit like a lex grammar,
Yes, I think this is likely.
> (Also, minor nit, I'm not sure "find" is the right verb here and
> elsewhere. Mostly the regex engine just wants to check the "current"
> location. The rest is control flow.)
Yes, I'm (pardon the pun) finding that advanced "string find" operations
may not be of great importance until we start looking at some special-case
optimizations. I'd say to wait for those opcodes until we find (sorry!)
we need them.
> I see one other potential gotcha with respect to backtracking and
> closures. In P6, a closure can declare a hypothetical variable
> that is restored only if the closure exits "unsuccessfully". [...]
Indeed, but I'm gonna cross that bridge when I get to it.
Pm
Okay, except that hypotheticality is an attribute of a variable's
value, not of the pad it's in. As you wrote it above, $x would refer
to an external variable, which might well be in the outer lexical pad.
You can write $?x instead, which makes it automatically scoped to
the current rule (that is, it lives in the $0 object). But again,
that's largely independent of whether it's hypothetical. The binding
you did implies hypotheticality, but within an embedded closure it
wouldn't be hypothetical unless you said "let". That is,
my $x;
rule foo { a $x:=(b+) c }
is shorthand for something like
my $x;
rule foo { a (b+) { let $x := $1 } c }
: The following would happen in the example above:
:
: store_lex "bb" into hypopad("$x") after "abb"
: find "a" and fail the rule, backtracking (clear hypopad("$x"))
: store_lex "b" into hypopad("$x") after backtracking over one "b"
: find "b" next and fail the rule, backtracking again (clear)
: store_lex "b" into hypopad("$x") after second "ab"
: find "c" and succeed rule foo, return hypopad
:
: Essentially every close-paren triggers binding, and every back-track
: over a close-paren triggers clearing.
Yes, that's essentially correct. My quibble was simply that it may be
hard to keep track of what to clear out in the case of calling a
failure continuation.
: Because this is all part of the calling convention for a rule, there's
: no difference between a rule "passing" back hypotheticals to its caller
: and a sub-rule doing so to the rule which called IT.
Again, hypotheticality is (these days) independent of scope, though
a variable scoped to a rule certainly cannot live longer than its $0.
: Is that workable? Does it address your concern, Larry, or did I miss
: your point?
Well, kind of, but it's the "how" that gets interesting...
Larry
Huh? What do you mean by "semantics"? The only semantics needed are the
minimum necessary to answer the question "is the fred at offset i equal
to the fred X?" (Sorry, not sure if fred is actually character or
codepoint or whatever, and is probably all of them at different levels.)
We also almost certainly need to be able to do character class
comparisons, although if you assume that you can always transcode to
what the regex was compiled with, then you don't even need that --
instead, you need to be able to convert to something like a difference
list of numbered freds. But if we're talking about semantics, then yes
you need the character class manipulation.
Everything else in this list sounds like optimizations to me, and
probably not the right optimizations (I don't think it's possible to
predict what will be useful yet.)
For other things that parrot will be used for, I suspect that the first
3 will be needed.
I'm curious as to how you came up with that list; it seems to imply a
particular way of implementing the grammar engine. I would expect all of
that, barring certain optimizations, to be done directly with existing
pasm instructions.
There will be a need for saving a stack of former values of hypothetical
variables, which can also be done with pasm ops but might interact with
overloaded assignment or something wacky like that.
I mean "What actions does a regular expression engine need to
perform," especially in the face of potentially opaque or really
annoying and pushed off character classes. (Like you get with
Unicode, for example -- regardless of how much Deep Evil Knowledge
you have of it, it's still likely best to push off most character
class handling to the Unicode library, especially if there are
locale-specific overrides of some of the classes in force)
>Everything else in this list sounds like optimizations to me, and
>probably not the right optimizations (I don't think it's possible to
>predict what will be useful yet.)
All I did was run through the current regex engine and pull out what
I saw as its primitives. The grammar bits being added in for the new
grammar engine need some extra functionality, but none of it involves
looking in strings for things.
>I'm curious as to how you came up with that list; it seems to imply a
>particular way of implementing the grammar engine. I would expect all of
>that, barring certain optimizations, to be done directly with existing
>pasm instructions.
Yes, and some of the initial list already has ops to do those bits,
though I fully plan on evil cheating versions for some extra speed.
--
Dan
--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
If I recall correctly, someone with the best intentions attempted
to write a clear, object-oriented (but still C/C++ based) regex
engine to replace the existing highly hairy code. As a result it
missed fitting in the cache, and ended up being like two orders
of magnitude slower.
Not arguing for premature optimization, but anyone thinking about
regex might want to look ahead (get it?) to the end game where
regex pretty much has to be within .2x of its current speed.
F.
I don't think we're going to be able to manage doing our matches in
20% of the time of the current regex engine. That's a bit ambitious,
even for me. :)
FWIW, we did some work on this a few years back. (And yes, the fact
that I can say that with a straight face scares me) We managed to run
only about 25% slower than the perl 5 regex engine when using the
basic string ops. This was pre-ICU, though, so the numbers may well
be significantly worse now.
Anyway, that's the whole point of this exercise. I want to know what
primitive operations the regex engine will need. (Or, rather, I've a
pretty good idea and I want to make sure nothing is missed) When I
know I plan on taking all the best accepted standards and practices
of software engineering, throwing them out, and cheating like hell
with those string primitives. If this means having several alternate
class matching schemes (since bitmaps are likely fastest for small
sets like 7/8 bit ASCII and some sort of trie fastest for large ones
for the asian sets and Unicode) then fine. If this means being able
to pin a single string in a fixed location to avoid a single pointer
indirection on each access, fine. If it means knowing that since the
input is US-ASCII or Latin-x characters are 8 bits and there are no
combining characters so we drop to direct byte access, fine.
I'm not so much resigned to cheating as looking forward to it with
relish. (And a bit of catsup, with a side of cole slaw)
> : Let's get concrete:
> :
> : rule foo { a $x:=(b*) c }
> : "abbabc"
> :
> : So, if I understand Parrot and Perl 6 correctly (heh, fat chance), a
> : slight modification to the calling convention of the closure that
> : represents a rule (possibly even a raw .Closure) could add a pad that
> : the callee is expected to fill in with any hypotheticals defined during
> : execution.
>
> Okay, except that hypotheticality is an attribute of a variable's
> value, not of the pad it's in.
Yes, I think I got that part, and perhaps I was being unclear or am
still missing something. Here's what I was saying, a slightly different
way:
As you enter a rule, you establish a new, free-floating pad. It *is*
stored on the current pad stack (so that its variables are available to
the rule and its closures), but, more importantly, it is part of the
rule's state because it is stored in C<$0>
When you bind a hypothetical it goes into this pad.
When you unbind a hypothetical (fail/backtrack) it is deleted from this
pad (its value doesn't just get undef).
When you return from the rule (and this is the key), you return C<$0>,
which, along with other state, contains a reference to this pad (and the
pad, of course contains a circular reference to C<$0>). The caller can
now do one of two things:
* Push this pad onto its stack. Pro: simple and fast
* Copy each variable from this pad in a "smart" way, searching up
the pad stack for a candidate variable to replace, and
defaulting to storing it in the inner-most pad as a new lexical.
I think the second one is the one you are describing (and described in
A5). The first is, IMHO, the cleaner solution, but I'm not suggesting
anything really, just pointing out the options.
My real point is that if you just establish such a free-floating
"hypopad" (sounds like something Dr. McCoy would use) in the rule, then
you get all of the hypothetical/backtracking behavior that you want,
regardless of how the caller integrates the variables with its scope. It
also keeps rules from having to search up through existing scope levels
themselves, keeping their complexity constrained to what they know best:
matching regular expressions and grammars. Perl's calling conventions
manage all of the extra complexity on return, and that's probably where
stack-walking code should go anyway.
> : Essentially every close-paren triggers binding, and every back-track
> : over a close-paren triggers clearing.
>
> Yes, that's essentially correct. My quibble was simply that it may be
> hard to keep track of what to clear out in the case of calling a
> failure continuation.
I'm not sure if that's going to be true or not, as thinking in terms of
failure continuations hurts my brain ;-) Still, I'm 99% sure that what I
describe above puts all of the "what to clear" state in the pad that you
return. Nice and easy.
A side point to Dan: In reading P6&PE, I don't see an op for deleting an
entry from a pad. At least for this, and I think for some other things
that aren't coming to mind right now, it's probably going to be needed.
If it's already there, but just not in P6&PE, cool and thanks! ;-)
I dunno, there are a number of extant cases of languages that
manage to run regexes just as fast as the current regex engine.
The fact that they happen to, well, _embed_ the current regex
engine may be germane...
Although the next regex engine has to deal with the horribly
crufty new perl6 syntax, at this point I'm willing to put down
$50 that after all the boiling what we get at the bottom of the
saucepot looks much like:
regex P0, P1, P2
and not much like 2000 lines of pasm, but I've been wrong
before and will be again.
F.
> A side point to Dan: In reading P6&PE, I don't see an op for deleting an
> entry from a pad.
$P0 = peek_pad
delete $P0["foo"]
Deleting by index/depth is unimplemented and marked as TODO in
classes/scratchpad.pmc
leo
True enough. Oh, don't get me wrong, I think we can go faster than
the perl 5 regex engine. I just don't think we can do in 2 seconds
what takes perl 5 10 seconds... :-P
>Although the next regex engine has to deal with the horribly
>crufty new perl6 syntax, at this point I'm willing to put down
>$50 that after all the boiling what we get at the bottom of the
>saucepot looks much like:
>
>regex P0, P1, P2
>
>and not much like 2000 lines of pasm, but I've been wrong
>before and will be again.
Last time I looked at the perl 5 regex engine (and I'm still
recovering, but the medication's helping) it was just a little
bytecode engine with massive amounts of Nasty Evil piled on top. I
think we can get a nice boost from the JIT or computed goto core,
enough so that a single big bytecode interpreter (parrot) will work
better than two smaller bytecode interpreters (which is what perl 5
w/the regex engine is, more or less)
I know. :)
> Lacking any kind of formal
>specification for it, my general thought is that people will
>generally (but not the biotech or chip design guys) feel OK
>about parrot if their regexes are only 5x slower. After that
>people start asking pointed questions.
If we come in 5x slower, I'll consider it a total and utter failure.
I'm not sure I'll consider anything worse than 5-10% slower than perl
5.005's regex speed acceptable enough to commit.
> > Last time I looked at the perl 5 regex engine (and I'm still
>> recovering, but the medication's helping) it was just a little
>> bytecode engine with massive amounts of Nasty Evil piled on top. I
>> think we can get a nice boost from the JIT or computed goto core,
>> enough so that a single big bytecode interpreter (parrot) will work
>> better than two smaller bytecode interpreters (which is what perl 5
>> w/the regex engine is, more or less)
>
>It'll be interesting to see how much cache coherency plays into
>the picture. The twitchy, beady-eyed fanged little hairy regex
>engine does have the benefit of being able to nestle into modern
>CPU caches, and get 1-5 ns access time rather than 30-120 ns.
>Probably accounts for its evolutionary hardiness.
Yep, that 1 cycle access time does make quite a difference. Still,
there's a lot to be gained by eliminating the bytecode interpreter
loop (if we JIT) or trim a lot of cycles off it (if we just use the
computed goto core) so there's a chunk to be saved there. What'll
kill us is abstract access to strings, which is what I was hoping
this thread could help us kill. ;)
> Although the next regex engine has to deal with the horribly
> crufty new perl6 syntax
Keep in mind that Perl 6 regexen are really just Perl 5 regexen with a
call stack and backtracking control. Absolutely everything else that I
see in P6 is either just a different syntax for the same thing (e.g.
character classes) or unrelated to the actual regex engine itself (e.g.
hypotheticals). There's nothing that I see in this that would slow down
a mundane regexp OTHER than Unicode, and in many respects P5 has already
taken that hit.
Now, that's not to say that:
rule perl6_program { <perl5_program> | <perl6_statement>(*) }
is supposed to run as fast as a Perl 5 regexp, but that's a WHOLE other
beast, and we don't expect it to be as simple as a regexp.
Under the hood, I would expect that P6 regexen will be broken down into
"matching" and "flow control" parts, and handed off to Parrot
differently. While there might or might not be an op for the matching
part, the flow control part is just code (though code with significant
magic, I will admit).
In other words, you might see:
rule { a <b> c }
get broken down into:
rule { {pasm('regexp P0, P1, "a"')} <b> {pasm('regexp P0, P1, "c"')} }
As Dan points out "regexp" might not exist, or (this seems more likely)
might just be a call-back into a tiny regexp compiler that generates
Parrot bytecode for the convenience of languages for which regex is not
a core feature that the compiler would want to get its hands dirty with.
So far, so good.
: When you bind a hypothetical it goes into this pad.
You're still confusing hypotheticality with storage (as I was when I
wrote the Apocalypse in question, so I'm certainly not blaming you :-).
If you bind a hypothetical, you are performing a "let" on that variable,
which is very much like a "temp" (a "local" in P5-ese). That is completely
independent of where the variable is stored.
If you have a variable with a C<?> secondary sigil, as in C<$?x>,
it stored in the rule's pad. The C<?> is functioning as a "my" with
respect to the rule. If there is no C<?>, then either the variable
is declared explicitly with C<my> (or equivalent) in an outer lexical
scope, or it's a package variable (in the absence of strictness).
The location of the variable is completely independent of whether
the variable is hypothetical.
In any case, a variable exists in the pad as soon as the pad is
created, whether it's a lexical pad or a rule pad (which is also
a lexical pad, as it happens--you just don't have to use "my").
In either case, the compiler knows that there's a lexical variable
of the name of either C<$x> or C<$?x>, and the initial pad reflects that.
(If we didn't do that, we'd screw up the closureness of an exception
handler, which in Perl 6 sees the pad for the lexical scope in which
it's embedded, and has to know that $x exists but is undefined even
before it is elaborated.)
: When you unbind a hypothetical (fail/backtrack) it is deleted from this
: pad (its value doesn't just get undef).
Neither of those is true. It must regress to the *previous* value, which
might or might not be undef. But it never disappears from its pad.
: When you return from the rule (and this is the key), you return C<$0>,
: which, along with other state, contains a reference to this pad (and the
: pad, of course contains a circular reference to C<$0>). The caller can
: now do one of two things:
:
: * Push this pad onto its stack. Pro: simple and fast
Which stack is that?
: * Copy each variable from this pad in a "smart" way, searching up
: the pad stack for a candidate variable to replace, and
: defaulting to storing it in the inner-most pad as a new lexical.
No, if the lower rule bound anything outside of its scope, it's
already bound. All the upper rule has to do is decide whether to
bind the lower $0 to some other name in its own upper $0. (It doesn't
have to. One post-A5 change is that rules that remember their subtree
are written <?expr>, while rules that throw away their subtree are
written <ws>.)
: I think the second one is the one you are describing (and described in
: A5). The first is, IMHO, the cleaner solution, but I'm not suggesting
: anything really, just pointing out the options.
There is no separate match stack, and there is no copying. There's
merely the tree of $0 objects, which ends up being your syntax tree
on a successful match. There is a separate mechanism that keeps
track of C<temp> and C<let> variables, which in some sense acts as a stack.
Its behavior must be intimately tied to backtracking. It is largely
independent of the lifetimes of the $0 pads, except insofar as there's
a correlation between backtracking over some variables, and wanting
to blow away the entire $0 pretty soon if you happen to backtrack out
of the whole rule.
: My real point is that if you just establish such a free-floating
: "hypopad" (sounds like something Dr. McCoy would use) in the rule, then
: you get all of the hypothetical/backtracking behavior that you want,
: regardless of how the caller integrates the variables with its scope. It
: also keeps rules from having to search up through existing scope levels
: themselves, keeping their complexity constrained to what they know best:
: matching regular expressions and grammars. Perl's calling conventions
: manage all of the extra complexity on return, and that's probably where
: stack-walking code should go anyway.
A "hypopad" is the wrong granularity for hypotheticality. $2 is usually
more hypothetical than $1 simply because it's based on the $1 hypothesis...
: > : Essentially every close-paren triggers binding, and every back-track
: > : over a close-paren triggers clearing.
: >
: > Yes, that's essentially correct. My quibble was simply that it may be
: > hard to keep track of what to clear out in the case of calling a
: > failure continuation.
:
: I'm not sure if that's going to be true or not, as thinking in terms of
: failure continuations hurts my brain ;-) Still, I'm 99% sure that what I
: describe above puts all of the "what to clear" state in the pad that you
: return. Nice and easy.
Sorry to inhabit your 1% unsureness, but that's precisely where I am.
The C<let> mechanism is independent of pad state, just like C<temp>.
A hypothetical variable is just a temporized variable with conditional
rollback on failure. Nice and easy. :-)
Larry
I do still think that you can do what I suggest, but I realize that it's
not as easy as handing around a single pad, you would actually need to
maintain either a list of pads (outside of the built-in pad stack,
probably inside of C<$0>) or a list of C<$0>s, each with their own pad.
If you do that, there's something really NIFTY that falls out of it:
premature (possibly temporary) exit from a rule results in the
restoration of all hypotheticals to their pre-rule state (because the
pads in which they live are no longer active (and possibly gone)).
Should you resume such a rule (e.g. because you had gone off to handle a
signal or exception), the hypotheticals all pick up their states again
and proceed. This kind of atomic hypothetical-updating / reversion could
be very valuable (and fast!), since exposing the state of hypotheticals
at the moment of a signal or exception doesn't really make a lot of
sense, and could cause some programs to fail in surprising ways.
Thanks for entertaining this brain-detour of mine. I return you to your
regularly scheduled language design without further ado.
> Sorry to inhabit your 1% unsureness, but that's precisely where I am.
Never doubted it ;-)
Rather than that, wouldn't you prefer to make "substring of target
string" the actual target of all these?
>*) exact string compare
>*) find string in string
>*) find first character of class X in string
>*) find first character not of class X in string
>*) find boundary between X and not-X
>*) Find boundary defined by arbitrary code (mainly for word breaks)
Also, I imagine you'll want a backwards/forwards option on some of
them, to support the p6 equivalent of /a.*b/.
--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>
Visualize This Signature
Only if the resulting substring'd be used in the match. Otherwise
you're better off with just offset or offset/length, and defer
substring extraction until the very end, and only if there are
captures.
> >*) exact string compare
>>*) find string in string
>>*) find first character of class X in string
>>*) find first character not of class X in string
>>*) find boundary between X and not-X
>>*) Find boundary defined by arbitrary code (mainly for word breaks)
>
>Also, I imagine you'll want a backwards/forwards option on some of
>them, to support the p6 equivalent of /a.*b/.
Yeah, I'd not thought about backwards searching. These would need
ranges for searching as well, to restrict how much of the string
they'd look through.
D'oh! Okay, I completely misunderstood. Yep, you're right and what I
was thinking of wasn't sufficient to be actually useful.
Yes, yes, yes, this would be far more useful.
Pm
Considering the semantics of m// and especially s/// at the user
level, we'll probably[*] want to take snapshots of dynamic strings
(think P5's "FETCH" or overload '""'), and apply all the pattern
operations to that snapshot. *However*, in the usual case of applying
the pattern to a simple string, we want to avoid the overhead of
making a copy. This may imply an API requirement.
For Topaz, Scalar's interface included a function that would basically
open the Scalar's hood, giving you a Buffer you could manipulate; then
when you were done working with the Buffer, its modifications (if any)
were propagated back down into the Scalar. In the case of a simple
string, the Buffer in question was the actual guts of the string, so
the "propagation" was a nop; for magical Scalars, the Buffer was a
mortal result of a FETCH-analogue, and after the Buffer was done being
manipulated it was STORE-analogued back into the Scalar and destroyed.
Now Parrot could just punt on this by doing all the logic of
snapshotting and copying as part of the implementation of the
language-level feature, not down at the byte code. But it's hard.
From the outside of a PMC, how can you reasonable ask the question
"are you a basic string"? Does that question always *have* an answer?
So we might think of "open up and say 'aaaah'" as a string feature
necessary for efficient regex work with minimal copying.
[*] Unless it's a _feature_ that given tied $a,
($a = "aaa") =~ s/a/b/g
would call STORE four times ("aaa", "baa", "bba", "bbb").
--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>
"Persistence in one opinion has never been considered
a merit in political leaders." --Marcus Tullius Cicero
> For Topaz, Scalar's interface included a function that would basically
> open the Scalar's hood, giving you a Buffer you could manipulate; then
> when you were done working with the Buffer, its modifications (if any)
> were propagated back down into the Scalar.
The get_string() and set_string_native() vtable methods should do the
right thing. The former returns a COWed copy of the string.
If these functions are overloaded they'll implement magic of some kind.
> From the outside of a PMC, how can you reasonable ask the question
> "are you a basic string"? Does that question always *have* an answer?
Well, the question is probably: how lazy are strings? Is it reasonable
to implement regexen in terms of iteration?
> [*] Unless it's a _feature_ that given tied $a,
> ($a = "aaa") =~ s/a/b/g
> would call STORE four times ("aaa", "baa", "bba", "bbb").
I'd expect two stores here. One for the initial setting of the value and
one for the final result of the global subst.
leo
Yeah, we need a bit more information tacked onto PMCs -- whether they
can return a STRING that properly represents them or not. (PMC
strings with pieces in different encodings, or with annotated text,
wouldn't be able to)
The plan here is for PMCs which can provide a STRING to do so, with
the regex code then grovelling over it. On completion we call the
set_string vtable method to put the resulting string back.
PMCs which can't then get iterated over much more slowly, probably
pulling characters out one by one. For these, the example you gave:
>[*] Unless it's a _feature_ that given tied $a,
> ($a = "aaa") =~ s/a/b/g
> would call STORE four times ("aaa", "baa", "bba", "bbb").
would call STORE four times. (Though I think all of perl 5's tied
variables would be able to return a string rep, so you'd get STORE,
FETCH, STORE called_
I think in all cases we'd get a new string out of a PMC, but since
it's a COW version it'd be reasonably quick to do so. OTOH, I can
certainly see the utility of returning the actual underlying string
if we can, to avoid having to copy data. Hrm. Gotta think on how that
should look a bit.
So how many stores do we expect for
($a = "xxx") =~ s/a/b/g
and which of the possible answers would be more useful?
--
"War against Terrorism" is an oxymoron
I think it depends on C<($a = "aaa") =~ s/a/b/g>.
* If the s/// operator stores once after all substitutions,
then having it alway store whether the subst happened
or not makes sense. "One = one STORE; one =~ one STORE."
* If the s/// operator stores once for each match/subst,
then if no matches happen then no STOREs happen.
Minimal stores are also relevant for this oddity:
($a = "aaa") =~ s{a}
{ $x++ ? substr($a,0,1) : 'b' }e;
Now, should that produce "bbb" or "baa"? I favor "baa", because I
think minimal stores are too important to compromise them just for
this bizarre use case.
> According to mar...@kurahaupo.gen.nz:
> > So how many stores do we expect for
> > ($a = "xxx") =~ s/a/b/g
> > and which of the possible answers would be more useful?
>
> I think it depends on C<($a = "aaa") =~ s/a/b/g>.
I would agree with you in general, but since we're generally after speed,
surely we want to allow for optimisations such as "don't store unless
something's changed"; this would also be compatible with the boolean context
value of s///.
-Martin
--
CAUTION: The information contained in this message is consequential and
subject to legacy provenance. If you are the intended recipient you are
hereby notified that reading this message is permitted. If you have not
received this message please notarise the sender and destroy the
originator.
I vote for leaving all of these sorts of cases undefined. Well,
partially defined -- I'd rather we didn't allow ($a = "aaa") =~ s/a/b/g
to turn $a into "gawrsh". At the very least, define the exact number of
output and stores for "strict aka slow mode", but have an optional
optimization flag that explicitly drops those guarantees. It would allow
for more flexibility in implementations.
I don't claim to follow all this talk about "stores", but from previous
Perl experience, your time in general is going to be dominated by
copying characters, not by the number of operations that control it
(unless by "store" you mean storing individual characters). In a few
cases you can do a modification in place, in which case you should,
but those cases are rather rare in real life, and getting rarer with
Unicode, and often end up being tr/// instead of s/// anyway.
So the main thing you have to avoid is copying all the characters
twice. This is not obvious with toy examples, but think in terms
of doing s/TTAGGG// on, say, the human genome. In the absence of a
chunking data structure, the most efficient general substitution on
a straight string is to build the new string once out of bits of the
old string and the substitution, then swap that entire string in as
the new definition of the string. And in an example like
($a = $genome) ~~ s/TTAGGG//
you'd like to capture the idea that there's already a copy being forced
in the context, so it'd be nice to use that copy to do the transformation
without inducing an additional copy.
There are also optimization modes where a single substitution like
$genome ~~ s/TTAGGG//
can be done "mostly in place". If the location to change is in the
front half of the string, you only relocate characters before it.
If the location is in the back half, you only relocate characters
after it.
And of course, if the match never matches, you should never copy anything
(except in the "en passant" case).
I'm probably preaching to the choir here, but I thought this all ought
to be made explicit. There's a reason Perl is the language of choice
for bioinformatics, and we need to be careful not to throw that away.
And indeed, it would be nice to pass those benefits off to other
languages running on Parrot.
Larry
Think about tied values. When does STORE get called, precisely, on a
tied target of s///? It's good to be explicit about this, down at the
C API level, just so we know what to optimize for. The final answer
is probably less interesting than the discussion that illuminates it.
> In an example like
> ($a = $genome) ~~ s/TTAGGG//
> you'd like to capture the idea that there's already a copy being forced
> in the context, so it'd be nice to use that copy to do the transformation
> without inducing an additional copy.
That's true and worth doing, but it's beside my intended point. Mea
culpa: I combined the assignment with the substitution to make the
example a one-liner, and thus muddied it. The s/// is the topic.
> There are also optimization modes where a single substitution like
> $genome ~~ s/TTAGGG//
> can be done "mostly in place".
Indeed, I had that in mind when I designed Topaz's "open the hood"
feature for string Scalars, which I mentioned earlier as a possible
model for optimizing s/// on normal strings while giving well-defined
semantics to s/// on abnormal strings.