if VAR =~ one of a list?

hea...@aracnet.com

unread,

Jun 6, 2006, 7:02:26 PM6/6/06

to

I'm in the process of learning Ruby, and just finished my first script that
is intended to do actual work. I doing so I came across an interesting
question.

How would I efficiently write the following statement?

if my_variable =~ /^some/i || my_variable =~ /^soome/i || my_variable
=~ /^save/i || my_variable =~ /^Catch/i || my_variable =~ /^BLARG/i ||
my_variable =~ /^safe/i || my_variable =~ /^blarg/i || my_variable =~
/^wrest/i ||my_variable =~ /^template/

Zane

Tim Hunter

unread,

Jun 6, 2006, 7:38:59 PM6/6/06

to

Untested, but I believe this will work. What's more interesting is _why_
it works.

case my_variable
when /^some/i, /^soome/i, /^save/i, ...all your other regexps
do whatever
else
do something else
end

Jeffrey Schwab

unread,

Jun 6, 2006, 11:46:49 PM6/6/06

to

Bit Twiddler

unread,

Jun 7, 2006, 1:54:35 PM6/7/06

to

Why is the "?:" required?

Thanks,
BT

Jeffrey Schwab

unread,

Jun 7, 2006, 9:16:11 PM6/7/06

to

Because the OP asked that the statement be written "efficiently." Text
between parentheses ordinarily is captured, i.e. stored in a special
variable, after each match. The ?: tells the regular expression parser
that the parentheses are being used only for grouping, and thereby
avoids the overhead of capturing text.

Bit Twiddler

unread,

Jun 8, 2006, 9:53:10 AM6/8/06

to

"Jeffrey Schwab" <je...@schwabcenter.com> wrote in message
news:vTKhg.15710$Qg.1...@tornado.southeast.rr.com...

Ahh... Thanks!!

Robert Klemme

unread,

Jun 8, 2006, 10:36:51 AM6/8/06

to

When talking about efficiency then the pattern can be made even better
(manual, probably incomplete optimization):

I.e. build a Trie and convert that into a regexp.

Cheers

robert

Jeffrey Schwab

unread,

Jun 12, 2006, 11:02:42 AM6/12/06

to

True, since Ruby has only an NFA (rather than a DFA) regex engine. I
find this equivalent code a little bit more readable, though:

Robert Klemme

unread,

Jun 12, 2006, 3:51:40 PM6/12/06

to

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Cheers

robert

Jeffrey Schwab

unread,

Jun 12, 2006, 7:58:25 PM6/12/06

to

Where? I don't see it. If anything, it looks like there is less
potential backtracking, since the e after the first alternation is not
duplicated.

Robert Klemme

unread,

Jun 13, 2006, 7:25:39 AM6/13/06

to

Every "?", "*", "+" and "|" can cause backtracking. The plain tree
approach (my RX) immediately fails on the first character of every
branch if it doesn't match while IMHO your RX needs to test more
characters before it can fail. That's where the backtracking occurs.

Do you know the regexp coach? It makes some aspects of RX more visible
- unfortunately I could not find a way that highlights backtracking
but the tree view is quite informative.

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice that
you need 16 clicks on "next" for my RX and 18 for yours until the match
finally failed.

Kind regards

robert

Jeffrey Schwab

unread,

Jun 13, 2006, 9:10:59 AM6/13/06

to

> Do you know the regexp coach? It makes some aspects of RX more visible

> - unfortunately I could not find a way that highlights backtracking
> but the tree view is quite informative.
>
> http://www.weitz.de/regex-coach/
>
> If you test both RX against "soomt" on the "Step" tab you'll notice that
> you need 16 clicks on "next" for my RX and 18 for yours until the match
> finally failed.

Thanks, I didn't know about that site. I'll check it out when I have a
little more time. I don't think you're counting backtracks, though; I
count 3 backtracked characters for either regex applied to "soomt": the
two o's and the m, since both patterns fail immediately at the t. In
fact, I can't think of any case off the top of my head for which either
regex does more backtracking than the other.

Robert Klemme

unread,

Jun 13, 2006, 9:42:01 AM6/13/06

to

If you watch for the sequence currently matched you'll see this

s
so
s
<empty>
...

s
so
soo
soom
so
s
<empty>
...

I'd say, more backtracking for the second RX.

I think I've also seen a RX engine that can be instrumented with
callbacks somewhere - that would also

Kind regards

robert

Jeffrey Schwab

unread,

Jun 13, 2006, 11:17:22 AM6/13/06

to

Robert Klemme wrote:

Jeffrey Schwab wrote:

Robert Klemme wrote:

> http://www.weitz.de/regex-coach/
>
> If you test both RX against "soomt" on the "Step" tab you'll notice
> that you need 16 clicks on "next" for my RX and 18 for yours until
> the match finally failed.

> If you watch for the sequence currently matched you'll see this

That's because the tool isn't showing you that it tried to match "me",
failed, matched "om", failed at the "e", and backtracked the second "o"
and the "m". It still happened; it's just that the engine tried to
match the "ome" sequence all at once, so it considers the failure
atomic, and ignores the fact that its string-matching loop got to "e"
before realizing there was no match. Most DFAs probably use the same
technique, so your regex will run faster on those, and probably require
less memory too.

> <empty>
> ....

> ....

>
> I'd say, more backtracking for the second RX.

The same characters were backtracked. Which pattern is more efficient
will depend on the specifics of the regex engine. We could always try
timeit, since we're in Ruby...

Robert Klemme

unread,

Jun 13, 2006, 11:45:20 AM6/13/06

to

But with a different number of branches tested.

> Which pattern is more efficient
> will depend on the specifics of the regex engine.

Well, this statement is never wrong. In this case we are talking about
a NFA and I don't see any optimization a RX engine could employ. Of
course I may be missing something but I'd say the "tree" version (below)
is generally better on an NFA.

> We could always try
> timeit, since we're in Ruby...

17:39:31 [Temp]: ./rx-bm.rb
Rehearsal ----------------------------------------
tree 9.485000 0.000000 9.485000 ( 9.534000)
long 9.984000 0.000000 9.984000 ( 9.858000)
------------------------------ total: 19.469000sec

user system total real
tree 9.516000 0.000000 9.516000 ( 9.545000)
long 9.765000 0.000000 9.765000 ( 9.797000)
17:40:24 [Temp]: ruby -e 'p 9.765000 / 9.516000'
1.02616645649433

Kind regards

robert

rx-bm.rb