surprising behavior of [regsub -all ..] at word boundaries \y

Bernhard Kick

unread,

May 15, 2002, 7:22:30 AM5/15/02

to

Hello,

i find the following behavior of regsub surprising, and would be glad
if someone could explain this...

case a)
% regsub -all {\yabc} abczabc X new
1
% set new
Xzabc

This is what I would expect.

However, case b)
% regsub -all {\yabc} abcabc X new
2
% set new
XX

This surprises me.

I had expected case b) to return 1, and new="Xabc",
because the second "abc" in "abcabc" does not start at a word boundary,
so it should not be replaced.

(I get the same results from tcl 8.2 on AIX and tcl 8.3 on Win2K)

Is this a bug, or what is wrong with my argument?

Regards,
Bernhard Kick

Glenn Jackman

unread,

May 15, 2002, 7:56:28 AM5/15/02

to

On 15 May 2002 04:22:30 -0700, Bernhard Kick <bernha...@gmx.de> wrote:
[...]

> % regsub -all {\yabc} abcabc X new
> 2
> % set new
> XX
>
> This surprises me.
>
>I had expected case b) to return 1, and new="Xabc",
>because the second "abc" in "abcabc" does not start at a word boundary,
>so it should not be replaced.

Reading http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm

\y matches only at the beginning *or end* of a word

whereas

\m matches only at the beginning of a word
\M matches only at the end of a word

However
% regsub -all {\mabc} abcabc X new

2
% set new
XX

Seems odd.

--
Glenn Jackman
gle...@ncf.ca

Michael A. Cleverly

unread,

May 15, 2002, 11:06:21 PM5/15/02

to

On 15 May 2002, Glenn Jackman wrote:

> However
> % regsub -all {\mabc} abcabc X new
> 2
> % set new
> XX
>
> Seems odd.

It does--but then most things dealing with regular expressions seem that
way at first (at least they do to me :-).

The way I make sense of the behavior is from a very literal reading of the
-all switch on the regsub man page.

"All ranges in string that match exp are found
and substitution is performed for each of these
ranges. Without this switch only the first
matching range is found and substituted."

So, for the first match (pass), the pattern {\mabc} matches:

abcabc <= string
^abc <= regular expression (^ == start of word)

substitution is then returned, and "new" is set to X.

abc <= remaining string
^abc <= regular expression

another immediate match, so substitution is preformed again, and "new" is
appended with another X.

If the input had been "abczabc" instead of "abc" things are different:

abczabc <= string
^abc <= regular expression

Match, so new == X

zabc <= remaining string
^abc <= regular expression

No match, so string is appended unchanged to new: Xzabc

The pattern {abc\M} doesn't have the same effect because the first abc is
"consumed" (as a non-match) before finding the second abc that then
matches on the end of word. (So "new" would be "abcX" instead.)

Michael

Jeffrey Hobbs

unread,

May 16, 2002, 1:01:10 AM5/16/02

to

Just to reaffirm in whole, I have gone through this logic more carefully
at the C level before and Michael's explanation is correct and accurate.

--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions

Bernhard Kick

unread,

May 16, 2002, 9:37:57 AM5/16/02

to

Jeffrey Hobbs <Je...@ActiveState.com> wrote in message news:<3CE33E54...@ActiveState.com>...

> Just to reaffirm in whole, I have gone through this logic more carefully
> at the C level before and Michael's explanation is correct and accurate.
>

Thanks for the explanation.
From an implementer's point of view it seems obvious(?)

Let me argue from a user's point of view:

From Michael's explanation follows, that a regsub word boundary
is not only between a nonword char and a word char (as expected),
but also at the start of the "remaining string" (after the RE match).

This means that a word boundary depends on the RE passed to regsub, and you
cannot tell where the word boundaries are by looking at the string only.

For example, in

regsub -all {\yabc} abcabc X new

the boundaries are |abc|abc| (| indicates word bounday)
but in
regsub -all {\ya} abcabc X new
the boundaries are |a|bca|bc|
and in
regsub -all {\y[a-c]} abcabc ....
the boundaries are |a|b|c|a|b|c|

IMHO, this is an unusual definition of word boundary,
and, worse, it differs from TCL's regexp definition.
If one asks regex, it answers (correctly, i think) that there
is _no_ word boundary in the middle of abcabc:
% regex -start 1 {\yabc} "abcabc"
0

IMHO, word boundaries (\y and \m, \M) in regsub should work the same
as in regexp (even if it makes the implementation harder).

BTW,
the start-of-string operator "^" does work as expected in regsub
(different from \y):

% regsub -all {^abc} abcabc X new
1
% set new
Xabc

Had i applied Michael's explanation to this example,
i would have predicted that it should return 2, and new=XX.
(extending Michael's explanation to "^" is probably not fair, sorry,
i do it anyway to make it obvious that regsub should be fixed...)

Thank you for listening,
Bernhard Kick

Donal K. Fellows

unread,

May 16, 2002, 9:32:20 AM5/16/02

to

Jeffrey Hobbs wrote:
> Just to reaffirm in whole, I have gone through this logic more carefully
> at the C level before and Michael's explanation is correct and accurate.

Ah, but is it desirable? If not, is there an easy way to stop it from
happening? I fear the answer is "no" to both of those questions, because
[regsub -all] works by running substrings through the regexp engine instead of
asking the engine for all the places where it found a match and I don't think
there's a way to ask for no matches before a certain point. Even though I
suppose that'd be possible in principle...

Donal (I am *NOT* about to start maintaining the regexp engine!)
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- "I'm going to open a new xterm. This one's pissing me off" Anon. (overheard)

Michael A. Cleverly

unread,

May 16, 2002, 11:17:06 AM5/16/02

to

On 16 May 2002, Bernhard Kick wrote:

> From Michael's explanation follows, that a regsub word boundary
> is not only between a nonword char and a word char (as expected),
> but also at the start of the "remaining string" (after the RE match).
>
> This means that a word boundary depends on the RE passed to regsub, and you
> cannot tell where the word boundaries are by looking at the string only.
>
> For example, in
> regsub -all {\yabc} abcabc X new
> the boundaries are |abc|abc| (| indicates word bounday)
> but in
> regsub -all {\ya} abcabc X new
> the boundaries are |a|bca|bc|

Actually here, given the -all switch, the regular expression, and the
input string, the boundary is actually:

It isn't |a|bca|bc| because:

|a|bca|bc|
^
|---- isn't the start of the "remaining" string after the
previous match

> and in
> regsub -all {\y[a-c]} abcabc ....
> the boundaries are |a|b|c|a|b|c|

Yes, this is true. Another curious way of looking at this is that the
following all yield the same end result:

regsub -all {\y.} $string X new
regsub -all {\y\w} $string X new
regsub -all {\w} $string X new

(at least, as far as I've been able to think through... :^)

> IMHO, this is an unusual definition of word boundary,
> and, worse, it differs from TCL's regexp definition.
> If one asks regex, it answers (correctly, i think) that there
> is _no_ word boundary in the middle of abcabc:
> % regex -start 1 {\yabc} "abcabc"
> 0

... and regsub, even with -all acts the same way:

% regsub -start 1 -all {\yabc} abcabc X new
0
% set new
abcabc

> IMHO, word boundaries (\y and \m, \M) in regsub should work the same
> as in regexp (even if it makes the implementation harder).

It isn't that the behavior of word boundaries (\y, \m, and \M) is
"different", the behavior exhibited is a direct result of the use of the
-all switch. For example, regexp exhibits the same behavior when you use
the -all switch with it:

% regexp -inline -all {\yabc} abcabc
abc abc
% regexp -inline -all {\yabc} abczabc
abc

[regexp] and [regsub] both do the same thing, as do [regexp -all] and
[regesub -all]. If anything, the documentation on the -all switch could
be made to be more explicit about this edge case, but I don't (personally)
think anything needs fixing in the regexp/regsub engine.

You can use a modified RE and substitution pattern to yield the original
expected type of result:

% set RE {\mabc(\w*)\M}
\mabc(\w*)\M
% set string "abc abczabc abcabc zabcabc abcabc"
abc abczabc abcabc zabcabc abcabc
% set replacement {X\1}
X\1
% regsub -all $RE $string $replacement new
4
% set new
X Xzabc Xabc zabcabc Xabc

> BTW,
> the start-of-string operator "^" does work as expected in regsub
> (different from \y):
>
> % regsub -all {^abc} abcabc X new
> 1
> % set new
> Xabc
>
> Had i applied Michael's explanation to this example,
> i would have predicted that it should return 2, and new=XX.
> (extending Michael's explanation to "^" is probably not fair, sorry,
> i do it anyway to make it obvious that regsub should be fixed...)

Except that ^ matches at the start of the original input string, not the
start of the remaining string, so there's a difference. ;-)

> Thank you for listening,
> Bernhard Kick

Fun stuff, these regular expressions... :^)

Michael

Jeffrey Hobbs

unread,

May 16, 2002, 11:26:35 AM5/16/02

to

Bernhard Kick wrote:
...

> From an implementer's point of view it seems obvious(?)
>
> Let me argue from a user's point of view:
>
> From Michael's explanation follows, that a regsub word boundary
> is not only between a nonword char and a word char (as expected),
> but also at the start of the "remaining string" (after the RE match).
>
> This means that a word boundary depends on the RE passed to regsub, and you
> cannot tell where the word boundaries are by looking at the string only.

The regexp/regsub isn't doing substrings, it is taking an offset
parameter to tell it where to "start". This does affect what it
also things of as the "beginning" of the string. It may be possible
to modify this, but I am not going to hazard the patch. Along
similar lines you see different behavior for your other example:

> BTW,
> the start-of-string operator "^" does work as expected in regsub
> (different from \y):

As you expect, it does:

(hobbs) 51 % regexp -all -inline {^a} aaa
a

but looks what happens when you use \A:

(hobbs) 52 % regexp -all -inline {\Aa} aaa
a a a

I actually did register this as a bug, although I noted the same
explanation (this is when I figured out this "quirk" myself).

Donal K. Fellows

unread,

May 16, 2002, 11:27:25 AM5/16/02

to

Bernhard Kick wrote:
> IMHO, this is an unusual definition of word boundary,
> and, worse, it differs from TCL's regexp definition.

The problem is that the system is passing substrings to the regexp engine
instead of telling the engine to look for places to match after a particular
point.

> If one asks regex, it answers (correctly, i think) that there
> is _no_ word boundary in the middle of abcabc:
> % regex -start 1 {\yabc} "abcabc"
> 0

What about:
% regexp -start 3 {\yabc} abcabc
1

Donal.

Darren New

unread,

May 16, 2002, 12:37:52 PM5/16/02

to

"Donal K. Fellows" wrote:
> The problem is that the system is passing substrings to the regexp engine
> instead of telling the engine to look for places to match after a particular
> point.

Another thing to consider is this case:

% regsub -all aba abababa X result
2
% puts $result
XbX

Note that it did not try to substitute X for the "aba" in the middle of
the string, and indeed I can't offhand think of any appropriate result
other than that. Hence, the fact that the substitution is interleaved
with the scanning is kind of necessary due to this behavior.

So when asking whether \y matches what comes before in the word, should
you expect a difference between

regsub -all {\ya} abcabc X new

and
regsub -all {\ya} abcabc " " new
?

In other words, should you look for word boundaries before or after you
substitute in the results?

I don't know the answer, but I thought I should bring it up. :-)

--
Darren New
San Diego, CA, USA (PST). Cryptokeys on demand.
** http://home.san.rr.com/dnew/DNResume.html **
** http://images.fbrtech.com/dnew/ **

My brain needs a "back" button so I can
remember where I left my coffee mug.

Bernhard Kick

unread,

May 17, 2002, 2:56:21 AM5/17/02

to

Jeffrey Hobbs <Je...@ActiveState.com> wrote in message news:<3CE3D0EA...@ActiveState.com>...

>
> > BTW,
> > the start-of-string operator "^" does work as expected in regsub
> > (different from \y):
>
> As you expect, it does:
>
> (hobbs) 51 % regexp -all -inline {^a} aaa
> a
>
> but looks what happens when you use \A:
>
> (hobbs) 52 % regexp -all -inline {\Aa} aaa
> a a a
>
> I actually did register this as a bug, although I noted the same
> explanation (this is when I figured out this "quirk" myself).

Thanks for this nice example.
Agreed, I would also call it a quirk (if not a bug).
Similaily:
% regex -inline -all {\ya} "aaa"
a a a

I have yet to meet someone who would think the string "aaa" contains
3 words...

My personal summary: word boundaries do not work in TCL's regexp/regsub -all.

Bernhard Kick

unread,

May 17, 2002, 4:41:33 AM5/17/02

to

Darren New <dn...@san.rr.com> wrote in message news:<3CE3E076...@san.rr.com>...

>
> Another thing to consider is this case:
>
> % regsub -all aba abababa X result
> 2
> % puts $result
> XbX
>
> Note that it did not try to substitute X for the "aba" in the middle of
> the string, and indeed I can't offhand think of any appropriate result
> other than that. Hence, the fact that the substitution is interleaved
> with the scanning is kind of necessary due to this behavior.
>

This looks right and desirable for me.
regexp/regsub work left-to-right and they do not report overlapping
matches.

> So when asking whether \y matches what comes before in the word, should
> you expect a difference between
>
> regsub -all {\ya} abcabc X new
> and
> regsub -all {\ya} abcabc " " new
> ?
>
> In other words, should you look for word boundaries before or after you
> substitute in the results?
>
> I don't know the answer, but I thought I should bring it up. :-)

In my mind the answer is obvious here:
regsub scans the original string (abcabc), and copies the
(substituted) result to new.
You look for word boundaries in the original string, which never
changes.
It does not matter what you substitute, because this will only show up
in the new string, which is built, but not scanned.

Bernhard Kick

unread,

May 17, 2002, 5:49:05 AM5/17/02

to

"Michael A. Cleverly" <mic...@cleverly.com> wrote in message news:<Pine.LNX.4.33.020516...@gibraltar.cleverly.com>...

>
> Actually here, given the -all switch, the regular expression, and the
> input string, the boundary is actually:
>
> |a|bcabc|
> ^ ^ ^
> | | |--- end of the original string
> | |--------- start of the "remaining" string after 1st match (w/ -all)
> |----------- beginning of the original string
>
> It isn't |a|bca|bc| because:
>
> |a|bca|bc|
> ^
> |---- isn't the start of the "remaining" string after the
> previous match

you are right. Thanks for correcting this.

> ....

>
> % regsub -start 1 -all {\yabc} abcabc X new
> 0
> % set new
> abcabc

no surprise here. But consider this:
% regsub -start 3 -all {\yabc} abcabc X new
1
% set new
abcX

This is to be expected given TCL's implementation of [regsub -all]
that you described so well,
but it does not match what i consider desirable behavior.

IMHO, there is no word boundary in the middle of "abcabc".
No matter whether I start scanning at position 0, 1, or 3.
Jeff Hobbs came up with this nice example, that points out what
the problem is:
% regexp -all -inline {\ya} aaa
a a a

TCL thinks "aaa" has three wourd boundaries followed by a.
I don't agree, sorry.
To me "aaa" is one word, and only the first a follows a word boundary.

Some variations on the theme are:
% regexp -all -inline {\ya} aaaa
a a a a
% regexp -all -inline {\ya} zaaaa
(empty)
% regexp -all -inline {\ya} /aaaa
a a a
% regexp -all -inline {\yaa} aaaa
aa aa
% regexp -all -inline {\yaa} zaaaa
(empty)
% regexp -all -inline {\yaa} /aaaa
aa aa

All these results clearly follow from TCL's implementation of regexp -all.

TCL's word boundaries are not a property of the string, but a consequence
of the way [regexp/regsub -all] happens to be implemented.

Problem is that this implementation breaks the nice and clean
definition of word boundary.

>
> Fun stuff, these regular expressions... :^)
>
> Michael

Yes, indeed...

Bernhard Kick

Vince Darley

unread,

May 17, 2002, 11:51:50 AM5/17/02

to

"Donal K. Fellows" <fell...@cs.man.ac.uk> wrote in message news:<3CE3CFDD...@cs.man.ac.uk>...

> Bernhard Kick wrote:
> > IMHO, this is an unusual definition of word boundary,
> > and, worse, it differs from TCL's regexp definition.
>
> The problem is that the system is passing substrings to the regexp engine
> instead of telling the engine to look for places to match after a particular
> point.

Can't it pass a substring /and/ a flag which says what the current
word/string status is? (i.e. "normal", "start of string", "start of
word", etc). Presumably it actually has that information available
when the recursive call is made...

Vince.

Donal K. Fellows

unread,

May 17, 2002, 11:49:03 AM5/17/02

to

Darren New wrote:
> So when asking whether \y matches what comes before in the word, should
> you expect a difference between
>
> regsub -all {\ya} abcabc X new
> and
> regsub -all {\ya} abcabc " " new
> ?

Apart from the difference between " " and X? :^) No, the substitutions should
happen in the same places.

> In other words, should you look for word boundaries before or after you
> substitute in the results?

Before, but you must not match anywhere that overlaps with a previously matched
area; it should be as if you multiplied the matcher automaton with one that
doesn't match anywhere up to the end of the previously matched area and matches
everywhere after that.

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk

-- This may scare your cat into premature baldness, but Sun are not the only
sellers of Unix. -- Anthony Ord <n...@rollingthunder.clara.co.uk>