Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

exegesis 5 question: matching negative, multi-byte strings

5 views
Skip to first unread message

es...@rama.comp.pge.com

unread,
Oct 1, 2002, 3:27:24 PM10/1/02
to perl6-l...@perl.org
I was wondering what the favored syntax in perl6 would be to match negative
multi-byte strings. In perl 5:

$sql = "select * from a where b union select * from c where d";

my $nonunion = "[^u]|u[^n]|un[^i]|uni[^o]|unio[^n]";
my (@subsqls) = ($sql =~ m"((?:$nonunion)*");

guaranteeing that the subsqls have all text up to, but not including the string
"union".

I suppose I could say:

rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }

although that seems awful slow, and I suppose I that I could do the same thing
in perl6 as I did in perl5, although that gets ugly if you need to combine
matching strings without "union" in them with, say parens:

rule parens { \* [ <-[()]> + : | <self> ]* \) }
rule non_union_non_parens
{
[ < -[()u] > |
u < -[()n] > |
un < -[()i] > |
uni < -[()o] > |
unio < -[()n] >
]
}

my (@subsqls) = ($sql =~ m" ([ <non_union_non_parens> | <parens> ]*) ");

And finally, I suppose I could write a sql grammar (which for this application,
and most) is definitely overkill. So I guess I'd like something shorter,
something where you could say:

< -["union"] >

or

< -["union"\(\)] >

or

< -["union""select"\(\)] >

a generic negative, multi-byte string matching mechanism. Any thoughts?
Am I missing something already present or otherwise obvious?

Ed

Luke Palmer

unread,
Oct 1, 2002, 3:24:45 PM10/1/02
to es...@rama.comp.pge.com, perl6-l...@perl.org

> [Negative matching]

> a generic negative, multi-byte string matching mechanism. Any thoughts?
> Am I missing something already present or otherwise obvious?

Maybe I'm misundertanding the question, but I think you want negative
lookahead:

Perl 5: /(.*)(?!>union)/
Perl 6: /(.*) <!before: union>/

Luke

es...@rama.comp.pge.com

unread,
Oct 1, 2002, 3:47:24 PM10/1/02
to Luke Palmer, perl6-l...@perl.org

no, that doesn't work, because of the way regexes operate. The '.*' captures
everything, and since the string after everything (ie: the end of the string)
doesn't match 'union', the regex succeeds without backtracking. Try it:

perl -e ' $a = "this has the string union in it"; my ($b) = ($a =~ m"(.*)(?!>union)"); print $b;'

prints:

this has the string union in it

not 'this has the string'.

Ed

Jonathan Scott Duff

unread,
Oct 1, 2002, 4:17:03 PM10/1/02
to es...@rama.comp.pge.com, Luke Palmer, perl6-l...@perl.org
On Tue, Oct 01, 2002 at 12:47:24PM -0700, es...@rama.comp.pge.com wrote:
> On Tue, Oct 01, 2002 at 01:24:45PM -0600, Luke Palmer wrote:
> >
> > > [Negative matching]
> >
> > > a generic negative, multi-byte string matching mechanism. Any thoughts?
> > > Am I missing something already present or otherwise obvious?
> >
> > Maybe I'm misundertanding the question, but I think you want negative
> > lookahead:
> >
> > Perl 5: /(.*)(?!>union)/
> > Perl 6: /(.*) <!before: union>/
> >
> > Luke
>
> no, that doesn't work, because of the way regexes operate. The '.*' captures
> everything, and since the string after everything (ie: the end of the string)
> doesn't match 'union', the regex succeeds without backtracking. Try it:

I think what you want is just a negated assertion:

/<!'union'>+/

Although I don't know what that means exactly. Does it match 5
characters at a time that aren't "union" or does it match one
character at a time as long as the string "union" isn't matched at
that point?

-Scott
--
Jonathan Scott Duff
du...@cbi.tamucc.edu

Simon Cozens

unread,
Oct 1, 2002, 4:34:40 PM10/1/02
to perl6-l...@perl.org
du...@cbi.tamucc.edu (Jonathan Scott Duff) writes:
> I think what you want is just a negated assertion:
>
> /<!'union'>+/
>
> Although I don't know what that means exactly.

That matches more than one thing that is not the string "union".
"u" is not the string "union"; "n" is not the string "union"...

I think /(.*) <commit> <!{ $1 =~ rx{union} }>/ may do it.

--
There is no distinction between any AI program and some existent game.

Mike Lambert

unread,
Oct 1, 2002, 6:32:07 PM10/1/02
to es...@rama.comp.pge.com, perl6-l...@perl.org
> guaranteeing that the subsqls have all text up to, but not including the string
> "union".
>
> I suppose I could say:
>
> rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }

What's wrong with: ?

rule getstuffbeforeunion { (.*?) union | (.*) }

"a union" => "a "
"b" => "b"

Am I missing something here?

Mike Lambert


es...@rama.comp.pge.com

unread,
Oct 1, 2002, 9:47:25 PM10/1/02
to Mike Lambert, perl6-l...@perl.org

hmm... well, it works, but its not very efficient. It basically
scans the whole string to the end to see if there is a "union" string, and
then backtracks to take the alternative. And hence, its not very scalable.
It also doesn't 'complexify' very well.

Suppose you had a long string of text, and you wanted to 'harden' your regex
against the substring union appearing in double-quoted strings, single-quoted
strings, etc. etc, without writing a sql parser. I just don't see how to do this
with ? - I would do something like (taking a page from Mr. Friedl's book ) -

rule regex_matching_sql
{
[
<-[u()"']>+ : |
<parens> : |
<double_string> : |
<single_string> : |
<non_union>
]*
}

rule parens
{
\(
[
<-["'()]>+ : |
<double_string> : |
<single_string> : |
<self>
]*
\)
}

rule single_string
{
\' [ <-[\'\\]>+ : | \.\' ]* \'
}

rule double_string
{
\" [ <-[\"\\]>+ : | \.\" ]* \"
}

rule non_union { [ u < - ['"()n] > | un ... | uni ... | unio ... | u$ ] * }

Of course I could also be missing something, but I just don't see how to do this
with .*?.

Ed

(ps:
As for:

/(.*) <commit> <!{ $1 =~ rx{union} }>/

I'm not sure how that works; and whether or not its very 'complexifiable'
(as per above) . If it does a match against every single substring (take all
characters, look for union, if it exists, roll back a character, do
the same thing, etc. etc. etc.) then this isn't good enough. The non_union
rule listed above is about as efficient as it can get; it does no backtracking,
and it keeps the common matches up front so they match first without
alternation.
)

Markus Laire

unread,
Oct 2, 2002, 3:39:17 AM10/2/02
to es...@rama.comp.pge.com, perl6-l...@perl.org, Mike Lambert
On 1 Oct 2002 at 18:47, es...@rama.comp.pge.com wrote:

> > > all text up to, but not including the string "union".
> >

> > rule getstuffbeforeunion { (.*?) union | (.*) }
> >
> > "a union" => "a "
> > "b" => "b"
>

> hmm... well, it works, but its not very efficient. It basically
> scans the whole string to the end to see if there is a "union" string, and
> then backtracks to take the alternative. And hence, its not very scalable.
> It also doesn't 'complexify' very well.

What about

Perl 5: /(.*?)(?:union|$)/
Perl 6: /(.*?) [union | $$]/

or if you want to exlude 'union' from match

Perl 5: /(.*?)(?=union|$)/
Perl 6: /(.*?) [<after: union> | $$]/

IMHO those should scan string one char at a time until 'union' or end-
of-string, which is optimal solution.

--
Markus Laire 'malaire' <markus...@nic.fi>


es...@rama.comp.pge.com

unread,
Oct 2, 2002, 2:29:56 PM10/2/02
to Markus Laire, perl6-l...@perl.org
On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote:
> On 1 Oct 2002 at 18:47, es...@rama.comp.pge.com wrote:
>
> > > > all text up to, but not including the string "union".
> > >
> > > rule getstuffbeforeunion { (.*?) union | (.*) }
> > >
> > > "a union" => "a "
> > > "b" => "b"
> >
> > hmm... well, it works, but its not very efficient. It basically
> > scans the whole string to the end to see if there is a "union" string, and
> > then backtracks to take the alternative. And hence, its not very scalable.
> > It also doesn't 'complexify' very well.
>
> What about
>
> Perl 5: /(.*?)(?:union|$)/
> Perl 6: /(.*?) [union | $$]/
>
> or if you want to exlude 'union' from match
>
> Perl 5: /(.*?)(?=union|$)/
> Perl 6: /(.*?) [<after: union> | $$]/
>

that's exceedingly slow, at least by my benchmark. So far, I've got 4
possibilities:

my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;
my $regex3 = qr{(?:[^u]+|(?!union).)*}sx;
my $regex4 = qr{(.*?)(?=union|$)}sx;

timethese
(
100000,
{
'questionbang' => sub { ($line =~ m"($regex1)"); },
'questionbang2' => sub { ($line =~ m"($regex3)"); },
'alternation' => sub { ($line =~ m"($regex2)"); }
'nongreedy' => sub { ($line =~ m"($regex4)"); },
}
);


which come out:

alternation: 8 wallclock secs ( 7.71 usr + 0.00 sys = 7.71 CPU) @ 12970.17/s (n=100000)
questionbang: 17 wallclock secs (16.05 usr + 0.00 sys = 16.05 CPU) @ 6230.53/s (n=100000)
questionbang2: 8 wallclock secs ( 7.74 usr + 0.00 sys = 7.74 CPU) @ 12919.90/s (n=100000)
nongreedy: 41 wallclock secs (41.74 usr + 0.00 sys = 41.74 CPU) @ 2395.78/s (n=100000)


So yes, a form can be constructed out of ?! which is of approximately equal
speed to the alternation.

However, in straight C, the corresponding time is:

2.31u 0.02s 0:02.37 98.3%

which tells me that a lot of optimisation could be made with a generic
mechanism for (non)matching multi-byte character classes. The problem has
to be dealt with anyways when considering unicode... And which form would people
rather type:

(<-[^u]>+|(?!union).)*

or
<-[^'union']>*

I'd say the second scores over the first in intuition, if nothing else...

Ed

Joe Gottman

unread,
Oct 2, 2002, 8:16:28 PM10/2/02
to Perl6

> On 1 Oct 2002 at 18:47, es...@rama.comp.pge.com wrote:
>
> > > > all text up to, but not including the string "union".

How about (Perl6)

/(.*?) union {$pos -= length('union');}/

This gets everything up to and including the first instance of 'union',
then gets rid of the bit at the end that we don't want. The capturing
parentheses ensure that we return the part of the string we want, and we
manually reset $pos to the correct position. This is easy to understand,
and very extensible.

Joe Gottman


M Perl6lang 2002100700

unread,
Oct 7, 2002, 8:11:08 AM10/7/02
to es...@rama.comp.pge.com, perl6-l...@perl.org
> match negative multi-byte strings


in perl5, I'd tend to do

m/(?:(?!union).)*/is

or to capture

m/((?:(?!union).)*)/is

I suppose you could use union\b instead of union if you wanted allow
'unions' but disallow 'union'. The general idea is "gobble up each
character that isn't the start of 'union' (or union\b as you prefer).

In my unpracticed perl6, I *think* that gives you

noncapturing:
rx:i/[<!before union>.]*/

capturing:
rx:i/([<!before union>.]*)/

....but better perl6ers than I may spot translation issues there. The
same \b note would apply here (if it's still called \b in perl6 :)

Also, I haven't benchmarked this against the other perl5 solutions
posted here, so this could be horribly inneficient (in perl5, but maybe
per6 will like my way better ;)

-matt

es...@rama.comp.pge.com

unread,
Oct 7, 2002, 1:35:24 PM10/7/02
to m_perl6lang...@wickline.org, perl6-l...@perl.org
On Mon, Oct 07, 2002 at 07:11:08AM -0500, m_perl6lang...@wickline.org wrote:
> > match negative multi-byte strings
>
>
> in perl5, I'd tend to do
>
> m/(?:(?!union).)*/is
>
> or to capture
>
> m/((?:(?!union).)*)/is

yeah, I'm not arguing that there isn't a solution available, just that the
solution is convoluted and somewhat un-intuitive (witness the many people
who write (.*)(?!union)).

Anyways, to make that perform with any speed you need to say:

((?:[^u]+|(?!union).*))

which is uglier than sin...

Ed (Peschko)

0 new messages