Need help understanding a TCL expression

Dave Yeatman

unread,

Sep 8, 2005, 7:56:22 PM9/8/05

to

Hi folks,
I am new to the newsgroup, just started learning TCL. I have experience
in various flavors of VB, C++ and Delphi.

In TCL a few of the more cryptic regexp combinations have me scratching
my head.

Here's one from some example code I am trying to understand:

if {![regexp -nocase "${match}(.*?)\r\n\1" $str str data]} {return 0}

I've got the if NOT, -nocase, \r, \n and return 0 stuff Ok and I know
$str is my input string to regexp.

I can't seem to tie the stuff inside the quotes below all together and
make good sense out of it:

"${match}(.*?)\r\n\1" $str str data"

Would it be possible that someone could decipher it for me?

Thanks!

dave

Michael A. Cleverly

unread,

Sep 8, 2005, 10:14:59 PM9/8/05

to

On Thu, 8 Sep 2005, Dave Yeatman wrote:

> Hi folks,
> I am new to the newsgroup, just started learning TCL. I have experience in
> various flavors of VB, C++ and Delphi.

Welcome!

> In TCL a few of the more cryptic regexp combinations have me scratching my
> head.

The re_syntax man page is a good resource. And using "expanded" regular
expression syntax, with inline comments, can help lead to more
understandable (and maintainable) regexp's down the road.

http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/re_syntax.htm
http://wiki.tcl.tk/re_syntax

> Here's one from some example code I am trying to understand:
>
> if {![regexp -nocase "${match}(.*?)\r\n\1" $str str data]} {return 0}

I highly suspect that this regular expression either does not do what its
original author intended (or there was an error in re-typing it). But see
below for that, after an explanation of the various pieces...

> I've got the if NOT, -nocase, \r, \n and return 0 stuff Ok and I know $str is
> my input string to regexp.
>
> I can't seem to tie the stuff inside the quotes below all together and make
> good sense out of it:
>
> "${match}(.*?)\r\n\1" $str str data"
>
> Would it be possible that someone could decipher it for me?

${match} will substitute in the current value of the scalar variable
"match" in the local scope. It is wrapped in braces because otherwise the
following parenthesis (which are capturing in regexps) would be
misinterpreted as referencing an element of the array match.

Suppose the value of match was:

% set match "notice: "

then the argument that the regexp command receives as its second element
(since the first is the -nocase switch which tells the regexp to make
case-insensitive matches) would be the string

% set RE "notice: (.*?)\r\n\1"
notice: (.*?)\r\n\1

Well, actually that isn't technically true, because the \r and \n and \1
substitutions would have already occured. But it's hard to visualize that
white space in a Usenet post... you get the idea, I suspect.

The "(.*?)" portion of the regular expression is a capturing set of
parenthesis that is looking for a non-greedy match of 0 or more of any
character.

It would have been a greedy match if it was written as "(.*)" instead.
You already know about the \r and \n, but the \1 is where I think the
original author made a mistake.

In re_syntax you can refer to a previous set of capturing parenthesis with
a backslash followed by a number. So in this case I suspect they wanted
to make sure whatever was captured in parenthesis was repeated at the
start of the next line.

HOWEVER, because they used "quotes" instead of {braces} (because they
needed to references the variable $match) the Tcl interpreter will also do
\ substitution. Meaning \r and \n become the actual carriage return &
line feed characters, and \1 becomes just the number one. If this is the
case, change \1 to \\1 so that \\ becomes \ and the regexp command will
see the "\1" and use it to match the same contents of what was captured in
the parenthsis.

(As an aside, if you ever want to group, but not capture, use (?:foo)
instead of (foo)...)

So this regular expression pattern expects a line to have text that
matches whatever the pattern in the $match variable is followed by as
little (as possible) until it can then get a carriage return and line feed
followed by the number one.

Incidentally, the (.*?) was probably made non-greedy so it would match as
little output as possible, presumably not crossing the \r\n boundary.
Except that even though it is non-greedy, it still takes as much as it
needs to in order to find a succesful match.

Perhaps a better regexp (if I'm inferring the purpose correctly) would be:

set RE $match
append RE {([^\r]*)\r\n\1}

regexp -nocase $RE $str str data

If $match contains greedy quantifiers the combination of greedy/non-greedy
may not be immediately apparent. When you're ready to think about that,
this thread from 2001 may shed some light on things:

http://groups.google.com/group/comp.lang.tcl/browse_thread/thread/356ab281af942dda

Depending on where $match comes from it could contain data that would make
an invalid regexp. (For instance if it contained the string "1)", you'd
get an unbalanced set of parnethesis error from regexp.) If that could be
a problem you'd want to "escape" any special characters in match first.
For that I'd recommend using the [string map] command.

Completing our analysis of the command, the third argument $str is the
string the regexp is being tested against, str, the fourth argument is the
name of a variable that will contain the whole match (if there is one).
Then data, the fifth argument to this regexp command, will be set to
whatever the first set of parenthsis matched.

Michael

Schelte Bron

unread,

Sep 10, 2005, 4:14:40 AM9/10/05

to

Michael A. Cleverly wrote:

[...snip...]

>> if {![regexp -nocase "${match}(.*?)\r\n\1" $str str data]}
>> {return 0}
>
> I highly suspect that this regular expression either does not do
> what its
> original author intended (or there was an error in re-typing it).

[...snip...]

Michael did an excellent job at explaining most of the command.
Since I'm the root cause of this question, or the original author
as he put it, I may be able to fill in the gaps.

The contents of the $str variable is coming from a device that
terminates its transmissions with the string "\r\n\1", or in other
words the bytesequence (in hex) 0d 0a 01. For that reason the
string needs to be substituted before being passed to regexp
because as Michael indicated the \1 has a completely different
meaning there.

The value of the $match variable is limited to a very small set of
possibilities. In fact in this case it is guaranteed to have the
value "b5 cid: #=". None of those characters have any special
meaning to regexp, so it is save to use it like this. However it is
very good advice to keep in mind that in certain circumstances it
is not wise to use an unknown literal string inside a regular
expression.

Schelte.
--
set Reply-To [string map {nospam schelte} $header(From)]

Dave Yeatman

unread,

Sep 10, 2005, 9:44:07 AM9/10/05

to

Thanks Mike for the excellent explanation!

I had already looked at the resources you mentioned below and still had
a bit of a time getting my mind around this one.

I hope one day I'll be to a point that I can answer questions about TCL
as well....

dave

Dave Yeatman

unread,

Sep 10, 2005, 9:50:23 AM9/10/05

to

Hi Schelte!

Thank you for clearing up that last bit of confusion. I have the new
plugin we discussed at least working but I have a bit of debugging to do
yet...

dave

yahalom

unread,

Sep 12, 2005, 1:15:48 AM9/12/05

to

I recoment using visual regexp. it helps debugging and understanding
the regular expression. you can get it at
http://laurent.riesterer.free.fr/regexp/
it is written in tcl so no install is needed if you have tcl.

yahalom

unread,

Sep 12, 2005, 1:15:51 AM9/12/05

to

Jonathan Bromley

unread,

Sep 19, 2005, 9:11:48 AM9/19/05

to

It has a MUCH better designed GUI than my TREV
(Tcl Regular Expression Viewer); but TREV
parses the regexp and lets you see which pieces of it
matched each piece of source text; I think that's nice
for learning.

http://www.doulos.com/knowhow/tcltk/examples/trev/

The very best visual regexp machine I've ever seen is
the Regex Coach (Google for it, I've lost the link)
but that does Perl-flavoured regexps.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL, Verilog, SystemC, Perl, Tcl/Tk, Verification, Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, BH24 1AW, UK
Tel: +44 (0)1425 471223 mail:jonathan...@doulos.com
Fax: +44 (0)1425 471573 Web: http://www.doulos.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.