Advanced RegExp Problems

jlblackwell

unread,

Nov 20, 2001, 9:45:35 AM11/20/01

to

Hello all.

It would appear that my regexp skills have degraded since the last
time I wrote Tcl code. I'm having several problems understanding why
my regexps are behaving the way they are, and I would really
appreciate some help/suggestions. FWIW, I'm using TclPro 1.5.0 (info
patchlevel returns 8.3.2) on Solaris 8. First, let me give you a
sample of the text (in between the --Begin sample--/--End sample--
header/footer) I'm running the regexps against:

--Begin sample--

*RESULTS 1
THESE ARE MEASURED RESULTS 1 OF 4
>--RHO-- --XROT-- --XLAT-- --ROFF-- --RON-- --TEMP-- --TRIM-- --R1R7--
7.0000 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000
> L SENS01 SENS02
-2.112824 -1.178956 -1.178956
-2.095875 -1.437103 -0.904959
-1.129755 -1.716004 0.345548
-0.875513 -1.718864 0.534042
0.904181 -1.476384 1.561111
1.344868 -1.357532 1.769885
1.819453 -1.208442 1.981573
2.310988 -1.033087 2.188378
10.412835 2.775668 4.377818
10.904369 2.941182 4.442303
14.819695 3.902359 4.160608
14.836646 4.032926 4.032926
*RESULTS 1
THESE ARE MEASURED RESULTS 2 OF 4
>--RHO-- --XROT-- --XLAT-- --ROFF-- --RON-- --TEMP-- --TRIM-- --R1R7--
7.0000 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000
> L SENS01 SENS02
-2.167140 -0.830484 -0.830484
-2.149890 -0.979874 -0.640606
4.387865 0.033392 3.145168
*

--End sample--

Let's say that above string is stored in samplestr. Assume there is
always a terminating asterisk at the end of the last section.

The first problem:

regexp -lineanchor -- {(?:^\*RESULTS)(.*)(?=^\*)} $samplestr m
desiredsub

results in desiredsub containing everything following the first
instance of "*RESULTS" up to the terminating asterisk. Whereas

regexp -lineanchor -- {(?:^\*RESULTS.*?\n)(.*)(?=^\*)} $samplestr m
desiredsub

results in desiredsub containing just the first RESULTS section. Why
is it that gobbling the trailing spaces and newline after the
"*RESULTS" header causes the second subexpression to match 1 results
section whereas not doing that causes the second subexpression to
match all the sections?

The next problem:

regexp -lineanchor --
{(?:^\*RESULTS.*?\n)(?:.*)(?:^([[:space:]]+[[:digit:]\-]+.[[:digit:]]+){8}\n)(.+)(?=^\*)}
$samplestr m desiredsub

results in desiredsub containing "0.00000". If I take off the ?:
prefix on the first three subexpressions and examine them, they match
the parts of the header that I expect them to. I want the (.+)
subexpression to match all of the data that's in the 3-column format
(including the L/SENS01/SENS02 header), but I don't understand why I'm
getting what I'm getting.

Can someone please shed some light on these things?

Also, I've got a related question. Suppose I wanted to process a file
that had nothing but these RESULTS sections. If I wanted to
sequentially grab each RESULTS section (let's say in a for loop), what
would be the best way to do that? My current method is to use a
regexp to match a section, storing the matched section wherever I
want, then re-running the regexp with the -indices switch, storing the
last index, then feeding that index into the succeeding regexp with
the -start switch. Are there cleaner/better ways to do this? Is
there a way to trim a matching section off the top of the string or
anything like that?

Thanks very much for any help/answers. Please reply to post in NG
itself, do not reply via e-mail.

-John L. Blackwell

Cameron Laird

unread,

Nov 20, 2001, 10:07:51 AM11/20/01

to

In article <6dcd0ebf.01112...@posting.google.com>,

jlblackwell <jlbla...@freeze.com> wrote:
>Hello all.
>
>It would appear that my regexp skills have degraded since the last
>time I wrote Tcl code. I'm having several problems understanding why
>my regexps are behaving the way they are, and I would really
>appreciate some help/suggestions. FWIW, I'm using TclPro 1.5.0 (info

.
.
.
This would be an excellent time for someone to demonstrate
the utility of the ActiveState "RE debugger".
--

Cameron Laird <Cam...@Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

Jeffrey Hobbs

unread,

Nov 20, 2001, 12:31:06 PM11/20/01

to

Cameron Laird wrote:
>
> In article <6dcd0ebf.01112...@posting.google.com>,
> jlblackwell <jlbla...@freeze.com> wrote:
> >It would appear that my regexp skills have degraded since the last
> >time I wrote Tcl code. I'm having several problems understanding why
> >my regexps are behaving the way they are, and I would really
> >appreciate some help/suggestions. FWIW, I'm using TclPro 1.5.0 (info
...
> This would be an excellent time for someone to demonstrate
> the utility of the ActiveState "RE debugger".

For those who aren't familiar with that, Cameron is referring to
the RX, the RE analysis tool in the Komodo IDE. However, as
it is a graphical analysis tool, it would be hard to "demonstrate"
it on the newsgroup. It's kind of something you have to try for
yourself (ASPN Tcl is in beta, so you can get a trial license for
Komodo at http://www.activestate.com/Products/ASPN_Tcl/).

--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions

jlblackwell

unread,

Nov 20, 2001, 8:35:44 PM11/20/01

to

Jeffrey Hobbs <Je...@ActiveState.com> wrote in message news:<3BFA945C...@ActiveState.com>...

> For those who aren't familiar with that, Cameron is referring to
> the RX, the RE analysis tool in the Komodo IDE. However, as
> it is a graphical analysis tool, it would be hard to "demonstrate"
> it on the newsgroup. It's kind of something you have to try for
> yourself (ASPN Tcl is in beta, so you can get a trial license for
> Komodo at http://www.activestate.com/Products/ASPN_Tcl/).

Jeff and Cameron, thanks very much for the heads-up. I will try to
check that tool out.

I would still really appreciate it if someone could explain one or the
other (though both would be nice :) ) of the results I'm getting on
those regexps. I spent a while thinking about it today, and I still
fail to realize why they do what they do.

Anyone?

-John

jlblackwell

unread,

Nov 21, 2001, 8:31:16 AM11/21/01

to

Jeffrey Hobbs <Je...@ActiveState.com> wrote in message news:<3BFA945C...@ActiveState.com>...

> For those who aren't familiar with that, Cameron is referring to
> the RX, the RE analysis tool in the Komodo IDE. However, as

There is another problem in that Komodo is not available for
Solaris/SPARC, and the only kind of machine that I currently have
access to is a Sun Ultra 60. Are there any plans to support more
platforms with Komodo?

-John

Jeffrey Hobbs

unread,

Nov 21, 2001, 8:36:02 AM11/21/01

to

*Real* soon now, you'll see an updated release for Linux, and
Solaris is next on the platform list, but not with a certain
release date as yet.

Michael A. Cleverly

unread,

Nov 24, 2001, 10:06:30 PM11/24/01

to

It's not so much the gobbling the trailing spaces and newline after the
"*RESULTS" header per say that causes the problem, but the fact that you
are using a non-greedy quantifier to do so. Compare the output from:

% regexp -lineanchor -about -- $RE1 $samplestr
1 {REG_ULOOKAHEAD REG_UNONPOSIX}

and

% regexp -lineanchor -about -- $RE2 $samplestr
1 {REG_ULOOKAHEAD REG_UNONPOSIX REG_USHORTEST}

It's not necessarily intuitive (is anything with RE's ever? ;-) and it
isn't documented explicitly & clearly in the re_syntax man page, but
greendy vs non-greedy is an all or nothing thing; by default all
expressions are greedy, but a non-greedy one switches all to being
non-greedy.

For a good explanation of how come (and why it isn't necessarily a bug),
see this September 1999 comp.lang.tcl posting from Henry Spencer himself:

http://groups.google.com/groups?hl=en&selm=FIECG4.F75%40spsystems.net

Also, from Jeffrey Friedl's "Mastering Regular Expressions" (O'Reilly,
1997, 1-56592-257-3) pages 226-27:

A non-gredy construct vs. a negated character class

I find that people often use a non-greedy construct as an easily
typed replacement for a negated character-class, such as using
<(.+?)> instead of (([^>]+)>. Sometimes this type of replacement
works, although it is less efficient -- the implied loop of the star
or plus must keep stopping to see whether the rest of the regex can
match. Particularly in this example, it involves temporarily leaving
the parenthesis which, as Chapter 5 points out, has its own
performance penalty. Even though the non-greedy constructs are
easier to type and perhaps easier to read, make no mistake:
<i>what they match can be very different</i>.

[ ... a help example ommited ...]

The non-greedy constructs are without a doubt the most powerful
Perl5 additions to the regex flavor, but you must use them with
care. A non-greedy .*? is almost never a reasonable substitute
for [^...]* -- one might be proper for a particular situation, but
due to their vastly different meaning, the other is likely incorrect.

Your first RE matches beginning @ pos 10 & ending @ pos 1021. The second
matches from pos 18 to 650. Using the following RE you can match from pos
18 to 1021 instead:

{(?:^\*RESULTS[^\n]*\n)(.*)(?=^\*)}

> The next problem:
>
> regexp -lineanchor --
> {(?:^\*RESULTS.*?\n)(?:.*)(?:^([[:space:]]+[[:digit:]\-]+.[[:digit:]]+){8}\n)(.+)(?=^\*)}
> $samplestr m desiredsub
>
> results in desiredsub containing "0.00000". If I take off the ?:
> prefix on the first three subexpressions and examine them, they match
> the parts of the header that I expect them to. I want the (.+)
> subexpression to match all of the data that's in the 3-column format
> (including the L/SENS01/SENS02 header), but I don't understand why I'm
> getting what I'm getting.
>
> Can someone please shed some light on these things?

Again, same issue: (greedy & non-gredy quantifiers intermixed). Also your
RE above would match "numbers" like -3-432Z432 ... you probably want
[[:space:]]+-?[:digit:]+\.[:digit:]+ instead. (Personally I like
\s+-?\d+\.\d+ myself since it's shorter :-).

> Also, I've got a related question. Suppose I wanted to process a file
> that had nothing but these RESULTS sections. If I wanted to
> sequentially grab each RESULTS section (let's say in a for loop), what
> would be the best way to do that? My current method is to use a
> regexp to match a section, storing the matched section wherever I
> want, then re-running the regexp with the -indices switch, storing the
> last index, then feeding that index into the succeeding regexp with
> the -start switch. Are there cleaner/better ways to do this? Is
> there a way to trim a matching section off the top of the string or
> anything like that?

This is what I would use:

set RE {
\*RESULTS[^\n]*\n # header line
[^\n]*\n # these are x of y
[^\n]*\n # subheader of 8 columns
((?:\s+-?\d+\.\d+){8})\n # capture 8 values (a valid tcl list!)
[^\n]*\n # subheader of 3 columns
((?:(?:\s*-?\d+\.\d+){3}\s*)+) # capture 1 or more lines of 3 values each
# (which also is itself a valid tcl list!)
(?=\n\*) # look ahead and make sure there is an
# asterisk at the start of the next line
}

# assume all the results are in a string named results
# perhaps read straight from a file via fileutil::cat ...

foreach {match vals data} [regexp -inline -expanded -all $RE $results] {
foreach {rho xrot xlat roff ron temp trim r1r7} $vals break
foreach {l sens01 sens02} $data {

# ... whatever it is you do for each entry goes here

}
}

Does that do what you need it to?

Michael

jlblackwell

unread,

Nov 26, 2001, 10:43:34 AM11/26/01

to

"Michael A. Cleverly" <mic...@cleverly.com> wrote in message news:<Pine.LNX.4.33.011124...@gibraltar.deseretbook.net>...

> It's not so much the gobbling the trailing spaces and newline after the
> "*RESULTS" header per say that causes the problem, but the fact that you
> are using a non-greedy quantifier to do so. Compare the output from:

Wow! That's a *very* serious
implication/side-effect/whatever-you-want-to-call it! I guess when
you think about the algorithms being used to do the matching and read
that post by Henry Spencer, it makes sense, but on the outset it's
pretty darn unintuitive.

I'm really surprised this isn't explicitly detailed in the manpages,
or even what I would call the Tcl "Bible," Brent Welch's "Practical
Programming in Tcl and Tk". I've been using the 3rd ed. as my
reference, and I don't think there's anything in the regexp section of
that book which would have hinted at the greedy/non-greedy mixing
problem or helped me understand/forsee it (which is not to "diss" Mr.
Welch's book, as I think it's very good overall).

> Again, same issue: (greedy & non-gredy quantifiers intermixed). Also your
> RE above would match "numbers" like -3-432Z432 ... you probably want
> [[:space:]]+-?[:digit:]+\.[:digit:]+ instead. (Personally I like

Yes, you're quite right.

> foreach {match vals data} [regexp -inline -expanded -all $RE $results] {
> foreach {rho xrot xlat roff ron temp trim r1r7} $vals break
> foreach {l sens01 sens02} $data {
>
> # ... whatever it is you do for each entry goes here
>
> }
> }
>
>
> Does that do what you need it to?

Yes, excellent! This is the type of solution I was looking for -- the
start index thing seemed to be ugly and cumbersome. I'd never used
the -inline switch to regexp before. It's also not in Welch's book
(which is probably why I never thought to use it) so I'm guessing it
was added to Tcl after his book was written.

Well, thank you very, very much Michael. I really appreciate your
detailed answers to all my questions.

-John

Arjen Markus

unread,

Nov 26, 2001, 11:14:03 AM11/26/01

to

jlblackwell wrote:
>
> "Michael A. Cleverly" <mic...@cleverly.com> wrote in message news:<Pine.LNX.4.33.011124...@gibraltar.deseretbook.net>...
>
> > It's not so much the gobbling the trailing spaces and newline after the
> > "*RESULTS" header per say that causes the problem, but the fact that you
> > are using a non-greedy quantifier to do so. Compare the output from:
>
> Wow! That's a *very* serious
> implication/side-effect/whatever-you-want-to-call it! I guess when
> you think about the algorithms being used to do the matching and read
> that post by Henry Spencer, it makes sense, but on the outset it's
> pretty darn unintuitive.
>
> I'm really surprised this isn't explicitly detailed in the manpages,
> or even what I would call the Tcl "Bible," Brent Welch's "Practical
> Programming in Tcl and Tk". I've been using the 3rd ed. as my
> reference, and I don't think there's anything in the regexp section of
> that book which would have hinted at the greedy/non-greedy mixing
> problem or helped me understand/forsee it (which is not to "diss" Mr.
> Welch's book, as I think it's very good overall).
>

If I remember correctly, Brent Welch does explain the use of greedy
and non-greedy operators, but not at great length. He does have a great
deal to say about them and the chapter is already rather lengthy ;-)

Well, I have learned the flag -inline from this, which may be the
solution to some of my own problems with [regexp/regsub].

But I have one additional question, I think the answer is No, but
then again, regular expressions are a big subject:

If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
character might be repeated four times), can I formulate a regular
expression that can be used in [regsub] to replace each fourfold
character to its singular?

So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?

And this with any character (the repetition count is fixed).

Regards,

Arjen

Frank Pilhofer

unread,

Nov 26, 2001, 12:03:42 PM11/26/01

to

Arjen Markus <Arjen....@wldelft.nl> wrote:
>
> But I have one additional question, I think the answer is No, but
> then again, regular expressions are a big subject:
>
> If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
> character might be repeated four times), can I formulate a regular
> expression that can be used in [regsub] to replace each fourfold
> character to its singular?
>
> So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?
>

What about
regsub -all {(.){4}} $string {\1} result

Frank.

--
Frank Pilhofer ........................................... f...@fpx.de
Life would be a very great deal less weird without you. -Douglas Adams

Michael A. Cleverly

unread,

Nov 26, 2001, 12:06:50 PM11/26/01

to Arjen Markus

On Mon, 26 Nov 2001, Arjen Markus wrote:

> But I have one additional question, I think the answer is No, but
> then again, regular expressions are a big subject:
>
> If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
> character might be repeated four times), can I formulate a regular
> expression that can be used in [regsub] to replace each fourfold
> character to its singular?
>
> So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?
>
> And this with any character (the repetition count is fixed).

Like this?:

% set foo "HHHHEEEEAAAADDDDEEEERRRR"
HHHHEEEEAAAADDDDEEEERRRR
% regsub -all -- {(.)\1\1\1} $foo {\1} bar
6
% set bar
HEADER

Michael

Bruce Hartweg

unread,

Nov 26, 2001, 12:17:46 PM11/26/01

to

"Frank Pilhofer" <f...@fpx.de> wrote in message news:slrna04s...@rose.fpx.de...

> Arjen Markus <Arjen....@wldelft.nl> wrote:
> >
> > But I have one additional question, I think the answer is No, but
> > then again, regular expressions are a big subject:
> >
> > If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
> > character might be repeated four times), can I formulate a regular
> > expression that can be used in [regsub] to replace each fourfold
> > character to its singular?
> >
> > So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?
> >
>
> What about
> regsub -all {(.){4}} $string {\1} result
>

That's a bit dangerous - it works for the example but
if the original string was already HEADER - you will make it DER
(i.e. you replace every 4 chars with the last char of the group - regardless of if they were repeats)

better to use a backref

regsub -all {(.)\1{3}} $string {\1} result

for exactly 4 char repeats ,

and {(.)\1+} for arbitrary repeats

Bruce

Glenn Jackman

unread,

Nov 26, 2001, 12:13:29 PM11/26/01

to

On [Mon, 26 Nov 2001 17:14:03 +0100],

Arjen Markus <Arjen....@wldelft.nl> wrote:
> If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
> character might be repeated four times), can I formulate a regular
> expression that can be used in [regsub] to replace each fourfold
> character to its singular?
>
> So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?
>
>And this with any character (the repetition count is fixed).

set string HHHHEEEEAAAADDDDEEEERRRR
set rep 4
set magic_re "(.){$rep}"
regsub -all -- $magic_re $string {\1} new

Assumes: every single character in $string is repeated exactly
$rep times.

An interesting experiment: what would you guess the result of
regsub -all -- {(.){4}} "1234567890ab" {\1} answer; set answer
would be? I had expected "159".

--
Glenn Jackman

Michael A. Cleverly

unread,

Nov 26, 2001, 12:40:27 PM11/26/01

to

On 26 Nov 2001, Glenn Jackman wrote:

> An interesting experiment: what would you guess the result of
> regsub -all -- {(.){4}} "1234567890ab" {\1} answer; set answer
> would be? I had expected "159".

48b

as opposed to:

regsub -all -- {(.)...} "1234567890ab" {\1} answer; set answer

which gives the expected 159.

The parenthesis in (.){4} capture one character, essentially four times;
the fourth character, then, is the one left "captured" (that may not be
exactly how it works under the hood, but that's how I think of it anyway).

Michael

Arjen Markus

unread,

Nov 27, 2001, 2:32:02 AM11/27/01

to

Both of you, thanks. I never realised that (.) would create a
placeholder
to match a sequence of the same characters.

The problem initially arose when I caught the output of "man some
command"
and had to get rid of the overstrikes that create the bold text.
Remembering
some remarks about the limitations of regular expressions I thought it
could
NOT be solved that way, but you have proven otherwise.

Regards,

Arjen

Arjen Markus

unread,

Nov 27, 2001, 3:09:19 AM11/27/01

to

Arjen Markus wrote:
>
> "Michael A. Cleverly" wrote:
> >
> > On Mon, 26 Nov 2001, Arjen Markus wrote:
> >
> > > But I have one additional question, I think the answer is No, but
> > > then again, regular expressions are a big subject:
> > >
> > > If I have a string like "HHHHEEEEAAAADDDDEEEERRRR" (in fact any valid
> > > character might be repeated four times), can I formulate a regular
> > > expression that can be used in [regsub] to replace each fourfold
> > > character to its singular?
> > >
> > > So, given the above string, [regsub "$magic_re" $string] --> "HEADER"?
> > >
> > > And this with any character (the repetition count is fixed).
> >
> > Like this?:
> >
> > % set foo "HHHHEEEEAAAADDDDEEEERRRR"
> >

> > % regsub -all -- {(.)\1\1\1} $foo {\1} bar
> > 6
> > % set bar
> > HEADER
> >
> > Michael
>
> Both of you, thanks. I never realised that (.) would create a
> placeholder
> to match a sequence of the same characters.
>
> The problem initially arose when I caught the output of "man some
> command"
> and had to get rid of the overstrikes that create the bold text.
> Remembering
> some remarks about the limitations of regular expressions I thought it
> could
> NOT be solved that way, but you have proven otherwise.
>
> Regards,
>
> Arjen

I did some experimenting with other strings, like
"just a HHHHEEEEAAAADDDDEEEERRRR". The regular expression {(.)\1\1\1}
does the job I would have wanted, whereas {(.){4}} will return the
last of each four characters - as posted as well.

Regards,

Arjen

Glenn Jackman

unread,

Nov 27, 2001, 10:16:47 AM11/27/01

to

On [Tue, 27 Nov 2001 09:09:19 +0100],

Arjen Markus <Arjen....@wldelft.nl> wrote:
>I did some experimenting with other strings, like
>"just a HHHHEEEEAAAADDDDEEEERRRR". The regular expression {(.)\1\1\1}
>does the job I would have wanted, whereas {(.){4}} will return the
>last of each four characters - as posted as well.

That surprised me too -- being able to place backreferences within
the regex is an extremely powerful technique.

--
Glenn Jackman

jlblackwell

unread,

Nov 28, 2001, 11:35:45 AM11/28/01

to

Thanks for the previous answers, but it looks like I've run into some
more questions, one of them involving greedy stuff again :) .

First, a new, more detailed sample:

--Begin sample--

# blah
blah blah blah blah
*RESULTS 1 1
Large fixture
# IQ JZ JL JD K2 BLRR KITE MLRR GLVL ITN KQW KSI IFF
11 02 99 20 03 150 1 1 0 0 0
#--XBENDL- ----YY--- ----II--- -FUNC(X)- -FUNC(Y)- -FUNC(W)-
---LINE--XURIMXDBOD
21.9755 10.000 47.000 1.0 1.0 1.0 0.0
0.0 0.00
*SUBSECT 1

THESE ARE MEASURED RESULTS 1 OF 4

#----RHO-- ---XROT-- ---XLAT-- ---ROFF-- ---RON--- ---TEMP-- ---TRIM--
---R1R7--
7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000
# L SENS01 SENS02

-2.112824 -1.178956 -1.178956
-2.095875 -1.437103 -0.904959
-1.129755 -1.716004 0.345548
-0.875513 -1.718864 0.534042
0.904181 -1.476384 1.561111
1.344868 -1.357532 1.769885
1.819453 -1.208442 1.981573
2.310988 -1.033087 2.188378
10.412835 2.775668 4.377818
10.904369 2.941182 4.442303
14.819695 3.902359 4.160608
14.836646 4.032926 4.032926

#RESULTS 2 1
Large fixture
# IQ JZ JL JD K2 BLRR KITE MLRR GLVL ITN KQW KSI IFF
11 02 99 20 03 150 1 1 0 0 0
#--XBENDL- ----YY--- ----II--- -FUNC(X)- -FUNC(Y)- -FUNC(W)-
---LINE--XURIMXDBOD
97.4700 15.000 47.0000 1.0 1.0 1.0 0.0
0.0 0.00
#SUBSECT 1
THESE ARE MEASURED RESULTS 1 OF 1
#----RHO-- ---XROT-- ---XLAT-- ---ROFF-- ---RON--- ---TEMP-- ---TRIM--
---R1R7--
7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000
# L SENS01 SENS02
5.096404 4.458897 4.458897
5.105601 4.425416 4.496124
5.123996 4.404600 4.524673
5.151587 4.388152 4.552802
5.197573 4.374877 4.585586
#ENDCOND
10.00000 01.84961 10.00000 19.04772
10.00000 01.84961 10.00000 19.04772
10.00000 01.84961 10.00000 19.04772
10.00000 01.84961 10.00000 19.04772
13.00000 11.60510 12.72057 19.07276
-3.10308 11.59668 13.00000 19.12450
13.50000 11.50000 14.88300 19.47281

--End sample--

Before I go into the regexps, let me first say that I'm not looking
for advice on how to change my algorithm for grabbing desired lines
(e.g. no "why don't you just use a loop that..." posts please), I'm
just trying to get help on figuring out why my patterns are doing what
they're doing.

The first problem:

If I try to match lines following the *RESULTS line like this:

regexp -line -inline -- {(?:^\*RESULTS.*\n)(?:(?:^.*\n)+?)} $samplestr

the second subexpression matches all the lines following the line
containing *RESULTS. Shouldn't the non-greedy quantifier cause the
second subexpr to just match one line? I observed that if I changed
the first subexpr to (?:^\*RESULTS.*?\n) the second subexpr worked the
way I thought it ought to, but I don't understand why since regexp is
in -line mode and .* shouldn't match across a \n . Not to mention
that if I take off the ?: qualifier, regardless of whether I use
RESULTS.* or RESULTS.*? the first subexpr returns the same thing.

Now, the question may arise, why try to match one line with the +?
quantifier? The reason is that I later want to add another
subexpression which is a look-ahead. This leads to my next question.
The full regexp I'm trying to use is this:

regexp -line -inline --
{(?:^\*RESULTS.*\n)(?:(?:^.*\n)+?)(?=^\*[^S][^U][^B][^S][^E][^C][^T])}
$samplestr

which should result in matching all the lines starting with *RESULTS
and stopping at any line beginning with * but having a title (title =
sequence following *) other than SUBSECT. Chaining a positive and
negative look-ahead didn't seem to work. Now this lookahead subexpr
works like I want it to, but is there another way to do what I want
without explicitly complementing each character? I should also
mention that in this second regexp, the (?:(?:^.*\n)+?) section still
matches more lines than I would expect it to, which is why I reduced
the regexp to the first one I gave.

Thanks in advance,
-John

Bruce Hartweg

unread,

Nov 28, 2001, 6:51:12 PM11/28/01

to

jlblackwell wrote:

> Thanks for the previous answers, but it looks like I've run into some
> more questions, one of them involving greedy stuff again :) .
>

> Before I go into the regexps, let me first say that I'm not looking

> for advice on how to change my algorithm for grabbing desired lines
> (e.g. no "why don't you just use a loop that..." posts please), I'm
> just trying to get help on figuring out why my patterns are doing what
> they're doing.
>
> The first problem:
>
> If I try to match lines following the *RESULTS line like this:
>
> regexp -line -inline -- {(?:^\*RESULTS.*\n)(?:(?:^.*\n)+?)} $samplestr
>
> the second subexpression matches all the lines following the line
> containing *RESULTS. Shouldn't the non-greedy quantifier cause the
> second subexpr to just match one line? I observed that if I changed
> the first subexpr to (?:^\*RESULTS.*?\n) the second subexpr worked the
> way I thought it ought to, but I don't understand why since regexp is
> in -line mode and .* shouldn't match across a \n . Not to mention
> that if I take off the ?: qualifier, regardless of whether I use
> RESULTS.* or RESULTS.*? the first subexpr returns the same thing.

OK second point first, the subexpr is the same because of the -line option it keeps the
.* from going past a newline in either case. Basically it say give me the (longest/shortest)
match that is an entire line - same thing. BUT even though it matches the same it still
has a preference of longest match, and therefore the whole RE has a longest match preference
which is why the following subexpr matches more than you think it should - even though
the subexpr wants shortest, the RE in general prefers longest so you get a longest match
possible. So making the first subexpr non-greedy doesn't change the match of the subexpr,
but DOES change the preference of the RE to shortest so you get what you want

>
>
> Now, the question may arise, why try to match one line with the +?
> quantifier? The reason is that I later want to add another
> subexpression which is a look-ahead. This leads to my next question.
> The full regexp I'm trying to use is this:
>
> regexp -line -inline --
> {(?:^\*RESULTS.*\n)(?:(?:^.*\n)+?)(?=^\*[^S][^U][^B][^S][^E][^C][^T])}
> $samplestr

> which should result in matching all the lines starting with *RESULTS

> and stopping at any line beginning with * but having a title (title =

> sequence following *) other than SUBSECT. Chaining a positive and
> negative look-ahead didn't seem to work. Now this lookahead subexpr
> works like I want it to, but is there another way to do what I want
> without explicitly complementing each character? I should also
> mention that in this second regexp, the (?:(?:^.*\n)+?) section still
> matches more lines than I would expect it to, which is why I reduced
> the regexp to the first one I gave.

This won't work as written. if you change the first subexpr to non-greedy (as
explained above) it might work for this example, but it would also not stop
at a line starting with "*SOMETHING ELSE ENTIRELY" because you are
negating each letter individually with no relation to the others, so anything starting
with *S (or a U as 2nd char, etc) will also fail your lookahead constraint.

Instead of a non-greedy RE with a positive lookahead constraint to make it
as long as possible (with the constraint in mind) it would be better to specifically
match greedy anything that is NOT the start of a next section in your case you
want all lines that don't start with a * or else start with *SUBSECT

regexp -line -inline -- {^\*RESULTS.*\n(?:^(?:[^*]|\*SUBSECT).*\n)+} $samplestr

Note that since the ^\*RESULTS part doesn't need parens for anything in the RE,
and you didn't want to capture them anyway, it is easier to just leave them off.

Bruce

jlblackwell

unread,

Nov 28, 2001, 11:20:42 PM11/28/01

to

jlbla...@freeze.com (jlblackwell) wrote in message news:<6dcd0ebf.01112...@posting.google.com>...

> The full regexp I'm trying to use is this:
>
> regexp -line -inline --
> {(?:^\*RESULTS.*\n)(?:(?:^.*\n)+?)(?=^\*[^S][^U][^B][^S][^E][^C][^T])}
> $samplestr

I finally got my hands on a P-166 laptop with 128 MB RAM and '95, and
installed the AS Komodo 1.2 beta. Aside from Komodo running dog-slow,
when I use that regexp in Komodo's Rx toolkit with just multi-line
enabled, matching against the sample I gave, I get exactly the results
I expect. Is there some Tcl-specific detail I'm missing?

Thanks,
-John

jlblackwell

unread,

Nov 29, 2001, 10:12:15 AM11/29/01

to

Bruce Hartweg <brha...@bigfoot.com> wrote in message news:<3C057870...@bigfoot.com>...

> match that is an entire line - same thing. BUT even though it matches the same it still
> has a preference of longest match, and therefore the whole RE has a longest match preference
> which is why the following subexpr matches more than you think it should - even though
> the subexpr wants shortest, the RE in general prefers longest so you get a longest match
> possible. So making the first subexpr non-greedy doesn't change the match of the subexpr,
> but DOES change the preference of the RE to shortest so you get what you want

Is this yet another manifestation of combined greedy/non-greedy
behavior that Michael Cleverly detailed in response to my original
post? If so, what sets the total greedy/non-greedy preference, the
first quantifier?

> This won't work as written. if you change the first subexpr to non-greedy (as
> explained above) it might work for this example, but it would also not stop
> at a line starting with "*SOMETHING ELSE ENTIRELY" because you are
> negating each letter individually with no relation to the others, so anything starting
> with *S (or a U as 2nd char, etc) will also fail your lookahead constraint.

Is there any way to negate a whole clause like I had intended?

After all this I feel a bit humbled and also like I'm in over my head
somewhat relying only upon Brent Welch's regexp section in the book as
my regexp tutorial/reference. I presume this a good opportunity to
learn by reading "Mastering Regular Expressions"? How well does that
book apply to Tcl, knowing that there are some semantic differences
between, say Perl and Tcl? I know very little to no Perl and I'd
prefer not to have to pick up yet another language to understand
what's written in MRE.

Thanks,
-John

Bruce Hartweg

unread,

Nov 29, 2001, 1:32:46 PM11/29/01

to

"jlblackwell" <jlbla...@freeze.com> wrote in message news:6dcd0ebf.01112...@posting.google.com...

> Bruce Hartweg <brha...@bigfoot.com> wrote in message news:<3C057870...@bigfoot.com>...
>
>

> Is this yet another manifestation of combined greedy/non-greedy
> behavior that Michael Cleverly detailed in response to my original
> post? If so, what sets the total greedy/non-greedy preference, the
> first quantifier?
>

Yes it is the greedy v non-greedy issue which is a non-trivial issues
on long multi-part RE's to determine what the overall effect is, and
small differences (adding/removing parens for example) which don't
necesarily change the basic matching of the RE can possibly affect
the overall greedyness. See the MATCHING section of the re_syntax
man page and it goes thru the rules of preference of longest/shortest
match for an RE (determining how those rules apply in a specific case is
not always easy though).

http://tcl.activestate.com/man/tcl8.4/TclCmd/re_syntax.htm

>
> Is there any way to negate a whole clause like I had intended?
>

Yes, to match something that is NOT a given string you use alternation
with a negative lookahead contraint - so to match anything but foo
use {([^f]|f(?!oo)*} this can be expanded to be more than simple words
as long it you can express your inverse match in the negative lookahead
(meaning it can't have backrefs) thanks to Donal on first pointing this out
on a similar RE thread a while back. (can be seen at end of the Regular
Expression Examples Wiki page http://mini.net/tcl/989.html)

> After all this I feel a bit humbled and also like I'm in over my head
> somewhat relying only upon Brent Welch's regexp section in the book as
> my regexp tutorial/reference. I presume this a good opportunity to
> learn by reading "Mastering Regular Expressions"? How well does that
> book apply to Tcl, knowing that there are some semantic differences
> between, say Perl and Tcl? I know very little to no Perl and I'd
> prefer not to have to pick up yet another language to understand
> what's written in MRE.
>

that is an excellent book on learning RE which applies to tcl, perl, grep, etc.
there might be some slight difference in exact syntax from on implementation
to another (what shortcut escapes exist/mean etc) but the main meat is
the same.

Also see the Regular Expression Wiki page at http://mini.net/tcl/396.html
and the RE Debugging Tips page http://mini.net/tcl/1345.html for more info and links

Bruce

lvi...@yahoo.com

unread,

Nov 29, 2001, 3:02:57 PM11/29/01

to

According to jlblackwell <jlbla...@freeze.com>:
:After all this I feel a bit humbled and also like I'm in over my head

:somewhat relying only upon Brent Welch's regexp section in the book as
:my regexp tutorial/reference. I presume this a good opportunity to
:learn by reading "Mastering Regular Expressions"? How well does that
:book apply to Tcl, knowing that there are some semantic differences
:between, say Perl and Tcl?

Mastering Regular Expressions will provide you a much richer understanding
of how regular expressions works under the covers, as well as how to make
use of reg. exs. to solve problems. I seem to recall that Tcl regexs
are in fact covered during the course of the book.

--
"I know of vanishingly few people ... who choose to use ksh." "I'm a minority!"
<URL: mailto:lvi...@cas.org> <URL: http://www.purl.org/NET/lvirden/>
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.