Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

tcl 8.2 regexp not doing non-greedy matching correctly

740 views
Skip to first unread message

Mark Baldwin

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to
I am building a class to parse batch command line flags and arguments and
wanted
to use the new non-greedy regular expression syntax. I am using prowish 3
(aka
tcl 8.2). Non-greedy re's seem to work as long as there are no greedy re's
in
the expression. Under perl 5 non-greedy re's always work in a non-greedy
fashion as I expected, even with greedy ones around it. The expression (ss
in
the examples) is actually being generated in a loop for each flag.

The test in perl gives what I feel is the correct answer while the tcl one
does
not. test2.itcl is an attempt to get the whole expression to be
non-greedy.
It does what I would expect (but not quite what I want). I have tried many
variations. Does anyone know if this is a "bug" or a "feature" and how to
get
the same answer perl gives?

=======test.itcl=========
set tl "Plot Z VM -window 4 -volume 1 -segment 2"
puts "Parsing $tl"
set ss
{(?:-w|-window)\s+(.+?)(\s+(?:-w|-window)\s|\s+-volume\s|\s+-segment\s|\s+-l
inetype\s|\s+-symboltype\s|\s+-linecolor\s|$)}
#puts "ss: $ss"
set found [regexp -- $ss $tl match a1 a2]
puts "Found: $found\n1: |$a1| 2: |$a2|\n"

=======test.pl=========
$tl = "Plot Z VM -window 4 -volume 1 -segment 2";
print "Parsing $tl\n";
$ss
="(?:-w|-window)\\s+(.+?)(\\s+(?:-w|-window)\\s|\\s+-volume\\s|\\s+-segment\
\s|\\s+-linetype\\s|\\s+-symboltype\\s|\\s+-linecolor\\s|\$)";
print "ss: $ss\n";
$found = ($tl=~/$ss/);
print "Found: $found\n1: |$1| 2: |$2|\n\n";

=====================

=======test2.itcl=========
set tl "Plot Z VM -window 4 -volume 1 -linetype 2 4 6 3 3 3 6 4
-segment 2"
puts "Parsing $tl"
set ss
{(?:-w|-window)+?\s+?(.+?)(\s(?:-w|-window)+?\s|\s+-volume\s|\s+-segment\s|\
s+-linetype\s|\s+-symboltype\s|\s+-linecolor\s|$)}
#puts "ss: $ss"
set found [regexp -- $ss $tl match a1 a2]
puts "Found: $found\n1: |$a1| 2: |$a2|\n"

=====================

The logic, for those interested, is to look for each flag on the command
line
one at a time and get any and everything for that flag to the next valid
flag
(except leading and trailing white-spaces) to process. Since a user may
enter
some flags and not others and in any order my re is somewhat complicated.
This
is for a batch entry to a GUI system developed in itk, "c", "c++" and
FORTRAN.


Any wisdom would be appreciated.
Mark E. Baldwin
Lead Information Engineer

PS: I could abandon regexpr, make a list and process it. This would take
much more work and be less "clean". Not to mention that I may want regexp
to work later.

Jeffrey Hobbs

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to Mark Baldwin
Mark Baldwin wrote:
> I am building a class to parse batch command line flags and arguments and
> wanted
> to use the new non-greedy regular expression syntax. I am using prowish 3
> (aka
> tcl 8.2). Non-greedy re's seem to work as long as there are no greedy re's
> in
> the expression. Under perl 5 non-greedy re's always work in a non-greedy
> fashion as I expected, even with greedy ones around it. The expression (ss
> in
> the examples) is actually being generated in a loop for each flag.

Without going in-depth into your regexps, the above statement is pretty
much true. Non-greedy and greedy don't mix well. I thought this was a
bug, but someone corrected me about this being a "quirky feature". You
should check with Henry Spencer to see if he has refined this case.
You'll notice he's answered several regexp questions in the past couple
of days here (he's just given us, at Scriptics, a new engine that we
might be able to get into 8.2.1).

--
Jeffrey Hobbs The Tcl Guy
jeffrey.hobbs at scriptics.com Scriptics Corp.

Henry Spencer

unread,
Sep 21, 1999, 3:00:00 AM9/21/99
to
In article <7s6esh$7jq$1...@bgtnsc01.worldnet.att.net>,
Mark Baldwin <mark.e....@worldnet.att.net> wrote:
>...Does anyone know if this is a "bug" or a "feature" and how to

>get the same answer perl gives?

Unfortunately, the answer is that to get the same answer Perl gives, you
have to use Perl's exact regexp implementation.

It is very difficult to come up with an entirely satisfactory definition
of the behavior of mixed-greediness regular expressions. Perl doesn't
try: the Perl "specification" is a description of the implementation, an
inherently low-performance approach involving trying one match at a time.
This is unsatisfactory for a number of reasons, not least being that it
takes several pages of text merely to describe it. (That implementation
and its description are distant, mutated descendants of one of my earlier
regexp packages, so I share some of the blame for this.)

When all quantifiers are greedy, the Tcl 8.2 regexp matches the longest
possible match (as specified in the POSIX standard's regular-expression
definition). When all are non-greedy, it matches the shortest possible
match. Neither of these desirable statements is true of Perl.

The trouble is that it is very, very hard to write a generalization of
those statements which covers mixed-greediness regular expressions -- a
proper, implementation-independent definition of what mixed-greediness
regular expressions *should* match -- and makes them do "what people
expect". I've tried. I'm still trying. No luck so far.

The rules in the Tcl 8.2 regexp, which basically give the whole regexp a
long/short preference based on its subexpressions, are the best I've come
up with so far. The code implements them accurately. I agree that they
fall short of what's really wanted. It's trickier than it looks.

>The logic, for those interested, is to look for each flag on the command
>line one at a time and get any and everything for that flag to the next
>valid flag (except leading and trailing white-spaces) to process.

Note that if you ever add a new valid flag, that can break the parsing of
existing command lines. In such applications, it is generally better and
cleaner to define the parsing in terms of a more general rule for what is
and isn't a flag, rather than saying that anything which isn't a valid
flag (today) is a non-flag. (Regular expressions are still cursed with
annoying syntax inconsistencies because of a similar "we'll make parsing
as generous as we can, rather than making it simple and predictable"
decision made many years ago.)

>PS: I could abandon regexpr, make a list and process it. This would take
>much more work and be less "clean".

Much though I hate to say it, there are situations where One Big Regexp is
not the right way to do things.
--
The space program reminds me | Henry Spencer he...@spsystems.net
of a government agency. -Jim Baen | (aka he...@zoo.toronto.edu)

Mark Baldwin

unread,
Sep 21, 1999, 3:00:00 AM9/21/99
to

Henry Spencer <he...@spsystems.net> wrote in message
news:FIECG...@spsystems.net...

> In article <7s6esh$7jq$1...@bgtnsc01.worldnet.att.net>,
> Mark Baldwin <mark.e....@worldnet.att.net> wrote:
> >...Does anyone know if this is a "bug" or a "feature" and how to
> >get the same answer perl gives?
>
> Unfortunately, the answer is that to get the same answer Perl gives, you
> have to use Perl's exact regexp implementation.

I understand the difficulty of parsing re's. Also of trying to implement
all the features of some other system in your own. I'm impressed with re's
usefulness in tcl. I think you've done a good job. I would point out that
I learned re's for use with, I think, egrep? Many many years ago. I have
use them on almost every system, tool and language that supports them.
simple re's are fairly interchangeable between systems. Ill systems seem to
have their limits and failures though. Still, I think of re's as somewhat
of a standard. It be nice if the latest round stayed reasonably that way.
Since I reguraly program in a at least a dozen languages I could never keep
what works in each.

(I'm not exagarating. If not for Xemacs and it's re searching I'd have lost
my mind by now.)

>
> >The logic, for those interested, is to look for each flag on the command
> >line one at a time and get any and everything for that flag to the next
> >valid flag (except leading and trailing white-spaces) to process.
>
> Note that if you ever add a new valid flag, that can break the parsing of
> existing command lines. In such applications, it is generally better and
> cleaner to define the parsing in terms of a more general rule for what is
> and isn't a flag, rather than saying that anything which isn't a valid
> flag (today) is a non-flag. (Regular expressions are still cursed with
> annoying syntax inconsistencies because of a similar "we'll make parsing
> as generous as we can, rather than making it simple and predictable"
> decision made many years ago.)
>

Actually, since adding a flag is done by adding it to a list and that list
controlls the parsing in the re's, this approach makes it easy to add,
remove or change flags easily. You can even add them without writing the
code to support them and then they and their arguments are just ignored.
It took me two days to write this parser. On a previous, non tcl, project I
spent several months writing a slightly more sofisticated command parser.
This one was quick and easy to write, flexable, and, hopefuly, extensible by
the user support people. In spite of prefering assembly language to "c" or
Perl I find I love tcl and regular expressions.

Frederic BONNET

unread,
Sep 22, 1999, 3:00:00 AM9/22/99
to
Hi Mark,

Mark Baldwin wrote:
> PS: I could abandon regexpr, make a list and process it. This would take

> much more work and be less "clean". Not to mention that I may want regexp
> to work later.

This is not really germane to the main topic, but maybe you need more lexer-like
features than plain regexps. If your goal is to parse large strings, you should
try my tcLex extension:

http://www.multimania.com/fbonnet/Tcl/tcLex/index.en.htm

It uses Tcl's regexp features to provide a (f)lex-like control structure in Tcl.
It helps a lot when you need conditional matching that would need either writing
huge regexps or splitting the output.

See you, Fred
--
Frédéric BONNET frederi...@ciril.fr
---------------------------------------------------------------
"Theory may inform, but Practice convinces"
George Bain


0 new messages