$sql = "select * from a where b union select * from c where d";
my $nonunion = "[^u]|u[^n]|un[^i]|uni[^o]|unio[^n]";
my (@subsqls) = ($sql =~ m"((?:$nonunion)*");
guaranteeing that the subsqls have all text up to, but not including the string
"union".
I suppose I could say:
rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }
although that seems awful slow, and I suppose I that I could do the same thing
in perl6 as I did in perl5, although that gets ugly if you need to combine
matching strings without "union" in them with, say parens:
rule parens { \* [ <-[()]> + : | <self> ]* \) }
rule non_union_non_parens
{
[ < -[()u] > |
u < -[()n] > |
un < -[()i] > |
uni < -[()o] > |
unio < -[()n] >
]
}
my (@subsqls) = ($sql =~ m" ([ <non_union_non_parens> | <parens> ]*) ");
And finally, I suppose I could write a sql grammar (which for this application,
and most) is definitely overkill. So I guess I'd like something shorter,
something where you could say:
< -["union"] >
or
< -["union"\(\)] >
or
< -["union""select"\(\)] >
a generic negative, multi-byte string matching mechanism. Any thoughts?
Am I missing something already present or otherwise obvious?
Ed
> a generic negative, multi-byte string matching mechanism. Any thoughts?
> Am I missing something already present or otherwise obvious?
Maybe I'm misundertanding the question, but I think you want negative
lookahead:
Perl 5: /(.*)(?!>union)/
Perl 6: /(.*) <!before: union>/
Luke
no, that doesn't work, because of the way regexes operate. The '.*' captures
everything, and since the string after everything (ie: the end of the string)
doesn't match 'union', the regex succeeds without backtracking. Try it:
perl -e ' $a = "this has the string union in it"; my ($b) = ($a =~ m"(.*)(?!>union)"); print $b;'
prints:
this has the string union in it
not 'this has the string'.
Ed
I think what you want is just a negated assertion:
/<!'union'>+/
Although I don't know what that means exactly. Does it match 5
characters at a time that aren't "union" or does it match one
character at a time as long as the string "union" isn't matched at
that point?
-Scott
--
Jonathan Scott Duff
du...@cbi.tamucc.edu
That matches more than one thing that is not the string "union".
"u" is not the string "union"; "n" is not the string "union"...
I think /(.*) <commit> <!{ $1 =~ rx{union} }>/ may do it.
--
There is no distinction between any AI program and some existent game.
What's wrong with: ?
rule getstuffbeforeunion { (.*?) union | (.*) }
"a union" => "a "
"b" => "b"
Am I missing something here?
Mike Lambert
hmm... well, it works, but its not very efficient. It basically
scans the whole string to the end to see if there is a "union" string, and
then backtracks to take the alternative. And hence, its not very scalable.
It also doesn't 'complexify' very well.
Suppose you had a long string of text, and you wanted to 'harden' your regex
against the substring union appearing in double-quoted strings, single-quoted
strings, etc. etc, without writing a sql parser. I just don't see how to do this
with ? - I would do something like (taking a page from Mr. Friedl's book ) -
rule regex_matching_sql
{
[
<-[u()"']>+ : |
<parens> : |
<double_string> : |
<single_string> : |
<non_union>
]*
}
rule parens
{
\(
[
<-["'()]>+ : |
<double_string> : |
<single_string> : |
<self>
]*
\)
}
rule single_string
{
\' [ <-[\'\\]>+ : | \.\' ]* \'
}
rule double_string
{
\" [ <-[\"\\]>+ : | \.\" ]* \"
}
rule non_union { [ u < - ['"()n] > | un ... | uni ... | unio ... | u$ ] * }
Of course I could also be missing something, but I just don't see how to do this
with .*?.
Ed
(ps:
As for:
/(.*) <commit> <!{ $1 =~ rx{union} }>/
I'm not sure how that works; and whether or not its very 'complexifiable'
(as per above) . If it does a match against every single substring (take all
characters, look for union, if it exists, roll back a character, do
the same thing, etc. etc. etc.) then this isn't good enough. The non_union
rule listed above is about as efficient as it can get; it does no backtracking,
and it keeps the common matches up front so they match first without
alternation.
)
> > > all text up to, but not including the string "union".
> >
> > rule getstuffbeforeunion { (.*?) union | (.*) }
> >
> > "a union" => "a "
> > "b" => "b"
>
> hmm... well, it works, but its not very efficient. It basically
> scans the whole string to the end to see if there is a "union" string, and
> then backtracks to take the alternative. And hence, its not very scalable.
> It also doesn't 'complexify' very well.
What about
Perl 5: /(.*?)(?:union|$)/
Perl 6: /(.*?) [union | $$]/
or if you want to exlude 'union' from match
Perl 5: /(.*?)(?=union|$)/
Perl 6: /(.*?) [<after: union> | $$]/
IMHO those should scan string one char at a time until 'union' or end-
of-string, which is optimal solution.
--
Markus Laire 'malaire' <markus...@nic.fi>
that's exceedingly slow, at least by my benchmark. So far, I've got 4
possibilities:
my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;
my $regex3 = qr{(?:[^u]+|(?!union).)*}sx;
my $regex4 = qr{(.*?)(?=union|$)}sx;
timethese
(
100000,
{
'questionbang' => sub { ($line =~ m"($regex1)"); },
'questionbang2' => sub { ($line =~ m"($regex3)"); },
'alternation' => sub { ($line =~ m"($regex2)"); }
'nongreedy' => sub { ($line =~ m"($regex4)"); },
}
);
which come out:
alternation: 8 wallclock secs ( 7.71 usr + 0.00 sys = 7.71 CPU) @ 12970.17/s (n=100000)
questionbang: 17 wallclock secs (16.05 usr + 0.00 sys = 16.05 CPU) @ 6230.53/s (n=100000)
questionbang2: 8 wallclock secs ( 7.74 usr + 0.00 sys = 7.74 CPU) @ 12919.90/s (n=100000)
nongreedy: 41 wallclock secs (41.74 usr + 0.00 sys = 41.74 CPU) @ 2395.78/s (n=100000)
So yes, a form can be constructed out of ?! which is of approximately equal
speed to the alternation.
However, in straight C, the corresponding time is:
2.31u 0.02s 0:02.37 98.3%
which tells me that a lot of optimisation could be made with a generic
mechanism for (non)matching multi-byte character classes. The problem has
to be dealt with anyways when considering unicode... And which form would people
rather type:
(<-[^u]>+|(?!union).)*
or
<-[^'union']>*
I'd say the second scores over the first in intuition, if nothing else...
Ed
> On 1 Oct 2002 at 18:47, es...@rama.comp.pge.com wrote:
>
> > > > all text up to, but not including the string "union".
How about (Perl6)
/(.*?) union {$pos -= length('union');}/
This gets everything up to and including the first instance of 'union',
then gets rid of the bit at the end that we don't want. The capturing
parentheses ensure that we return the part of the string we want, and we
manually reset $pos to the correct position. This is easy to understand,
and very extensible.
Joe Gottman
in perl5, I'd tend to do
m/(?:(?!union).)*/is
or to capture
m/((?:(?!union).)*)/is
I suppose you could use union\b instead of union if you wanted allow
'unions' but disallow 'union'. The general idea is "gobble up each
character that isn't the start of 'union' (or union\b as you prefer).
In my unpracticed perl6, I *think* that gives you
noncapturing:
rx:i/[<!before union>.]*/
capturing:
rx:i/([<!before union>.]*)/
....but better perl6ers than I may spot translation issues there. The
same \b note would apply here (if it's still called \b in perl6 :)
Also, I haven't benchmarked this against the other perl5 solutions
posted here, so this could be horribly inneficient (in perl5, but maybe
per6 will like my way better ;)
-matt
yeah, I'm not arguing that there isn't a solution available, just that the
solution is convoluted and somewhat un-intuitive (witness the many people
who write (.*)(?!union)).
Anyways, to make that perform with any speed you need to say:
((?:[^u]+|(?!union).*))
which is uglier than sin...
Ed (Peschko)