Argument handling of [regexp]

69 views
Skip to first unread message

Erik Leunissen

unread,
Sep 16, 2022, 3:06:49 PM9/16/22
to
Here are the results of six invocations of the "regexp" command.

For one of them I'm sure that the result is correct (1).

For some of them I'm unsure (2, 3). I wouldn't be surprised if the result can be explained to be
correct though.

However, for invocations 4, 5 and 6 I definitely can't imagine how the results can be correct:

% set str "a-z"
a-z
% regexp - $str; #1
bad option "-": must be -all, -about, -indices, -inline, -expanded, -line, -linestop, -lineanchor,
-nocase, -start, or --
% regexp \- $str; #2
1
% regexp {-} $str; #3
bad option "-": must be -all, -about, -indices, -inline, -expanded, -line, -linestop, -lineanchor,
-nocase, -start, or --
% regexp -\\ $str; #4
couldn't compile regular expression pattern: invalid escape \ sequence
% regexp \\- $str; #5
1
% set initstring "-"
-
% regexp $initstring $str; #6
1
%

(Note that I'm aware of the purpose of "--" in invocations of "regexp" and of several other
commands. However, that is not my point. My point is the correctness of argument handling in the
above examples).

I'd be grateful for any explanations and judgement.

Erik.
--
elns@ nl | Merge the left part of these two lines into one,
xs4all. | respecting a character's position in a line.

Erik Leunissen

unread,
Sep 16, 2022, 3:19:43 PM9/16/22
to
On 16/09/2022 21:06, Erik Leunissen wrote:
> Here are the results of six invocations of the "regexp" command.
>
> For one of them I'm sure that the result is correct (1).
>
> For some of them I'm unsure (2, 3). I wouldn't be surprised if the result can be explained to be
> correct though.
>
> However, for invocations 4, 5 and 6 I definitely can't imagine how the results can be correct:
>

Aftre more thinking, I can imagine that 5 and 6 can be explained also.

But I can't wrap my mind around cases 4 and 3 (the latter additionally to my previous post).

briang

unread,
Sep 16, 2022, 8:54:52 PM9/16/22
to
The arguments typed in the source code is not percisly what the command actually sees. This is explained by reading the rules of Tcl, closely.
The best way to demonstrate this is by the following example:

proc myregexp {args} {
puts -nonewline "regexp "
foreach arg $args {
puts -nonewline "$arg "
}
puts ""
}

myregexp - $str; #1
myregexp \- $str; #2
myregexp {-} $str; #3
myregexp -\\ $str; #4
myregexp \\- $str; #5
set initstring "-"
myregexp $initstring $str; #6

The results:
regexp - a-z
regexp - a-z
regexp - a-z
regexp -\ a-z
regexp \- a-z
regexp - a-z

-Brian

Erik Leunissen

unread,
Sep 17, 2022, 9:39:54 AM9/17/22
to
On 17/09/2022 02:54, briang wrote:
>
> The arguments typed in the source code is not percisly what the command actually sees. This is explained by reading the rules of Tcl, closely.

Thanks Brian, I will investigate what you indicate.
Nonetheless, these results ... :

>
> The results:
> regexp - a-z
> regexp - a-z
> regexp - a-z
> regexp -\ a-z
> regexp \- a-z
> regexp - a-z
>

... indicate that the regexp command sees identical arguments for cases 1, 2, 3 and 6.
However, the *results* of the command invocations for these cases are not the same.

Therefore, this non-correspondence still puzzles me, and right now I can't imagine how any rule can
make that correspond. Nevertheless, I will have a look at "the rules of Tcl". Just to be sure: do
you mean the dodekalogue as in:

https://wiki.tcl-lang.org/page/Dodekalogue

Regards,
Erik
--

> -Brian

briang

unread,
Sep 17, 2022, 10:49:15 AM9/17/22
to
On Saturday, September 17, 2022 at 6:39:54 AM UTC-7, Erik Leunissen wrote:
> On 17/09/2022 02:54, briang wrote:
> >
> > The arguments typed in the source code is not percisly what the command actually sees. This is explained by reading the rules of Tcl, closely.
> Thanks Brian, I will investigate what you indicate.
> Nonetheless, these results ... :
> >
> > The results:
> > regexp - a-z
> > regexp - a-z
> > regexp - a-z
> > regexp -\ a-z
> > regexp \- a-z
> > regexp - a-z
> >
> ... indicate that the regexp command sees identical arguments for cases 1, 2, 3 and 6.
> However, the *results* of the command invocations for these cases are not the same.

I see what you mean. That is strange.

-Brian

Schelte

unread,
Sep 17, 2022, 11:09:10 AM9/17/22
to
On 16/09/2022 21:06, Erik Leunissen wrote:
> % regexp - $str; #1
> bad option "-": must be -all, -about, -indices, -inline, -expanded,
> -line, -linestop, -lineanchor, -nocase, -start, or --
> % regexp \- $str; #2
> 1
While these should be the exact same thing, they produce different byte
codes:

% ::tcl::unsupported::disassemble script {regexp - $str}
ByteCode 0x0x555be8c65680, refCt 1, epoch 17, interp 0x0x555be8bfa380
(epoch 17)
Source "regexp - $str"
Cmds 1, src 13, inst 10, litObjs 3, aux 0, stkDepth 3, code/src 0.00
Commands 1:
1: pc 0-8, src 0-12
Command 1: "regexp - $str"
(0) push1 0 # "regexp"
(2) push1 1 # "-"
(4) push1 2 # "str"
(6) loadStk
(7) invokeStk1 3
(9) done

% ::tcl::unsupported::disassemble script {regexp \- $str}
ByteCode 0x0x555be8c66180, refCt 1, epoch 17, interp 0x0x555be8bfa380
(epoch 17)
Source "regexp \- $str"
Cmds 1, src 14, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
Commands 1:
1: pc 0-6, src 0-13
Command 1: "regexp \- $str"
(0) push1 0 # "-"
(2) push1 1 # "str"
(4) loadStk
(5) regexp +3
(7) done

That seems to me like there's a bug lurking somewhere.


Schelte.


heinrichmartin

unread,
Sep 18, 2022, 6:07:36 AM9/18/22
to
This message evolved in a non-linear way while thinking/trying. I hope I cleaned up enough to prevent confusion ...

On Saturday, September 17, 2022 at 5:09:10 PM UTC+2, Schelte wrote:
> That seems to me like there's a bug lurking somewhere.

Just guessing: the byte-code compiler treats regexp special; and in this case it gets switch-handling wrong, i.e. it does not obey the Dodekalogue for switches.

Not just guessing: if we override regexp, the issue is gone.

set str foo
eval {regexp \- $str} ;# 0
rename regexp tcl_regexp
proc regexp args {tailcall tcl_regexp {*}$args}
eval {regexp \- $str} ;# bad option "-"

Using -- fixes the issue, too.

eval {regexp -- \- $str} ;# 0
eval {regexp -- - $str} ;# 0

Back to the original command. Still with -- in place, byte-code (obviously?) differs from Schelte's one ...

% set tcl_patchLevel
8.6.4
% ::tcl::unsupported::disassemble script {regexp -- \- $str}
ByteCode 0x0x23c31d0, refCt 1, epoch 15, interp 0x0x22fc680 (epoch 15)
Source "regexp -- \- $str"
Cmds 1, src 17, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
Commands 1:
1: pc 0-6, src 0-16
Command 1: "regexp -- \- $str"
(0) push1 0 # "-"
(2) push1 1 # "str"
(4) loadStk
(5) regexp +3
(7) done

% ::tcl::unsupported::disassemble script {regexp -- - $str}
ByteCode 0x0x23c33d0, refCt 1, epoch 15, interp 0x0x22fc680 (epoch 15)
Source "regexp -- - $str"
Cmds 1, src 16, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
Commands 1:
1: pc 0-6, src 0-15
Command 1: "regexp -- - $str"
(0) push1 0 # "*-*"
(2) push1 1 # "str"
(4) loadStk
(5) strmatch +0
(7) done

Now, this is really interesting: Tcl optimizes regexp by replacing it with string match for trivial pattern "-".
And now, I can also understand the byte-code better.

> On 16/09/2022 21:06, Erik Leunissen wrote:
> > % regexp - $str; #1
> > bad option "-": must be -all, -about, -indices, -inline, -expanded,
> > -line, -linestop, -lineanchor, -nocase, -start, or --
> > % regexp \- $str; #2
> > 1
> While these should be the exact same thing, they produce different byte
> codes:
>
> % ::tcl::unsupported::disassemble script {regexp - $str}
> ByteCode 0x0x555be8c65680, refCt 1, epoch 17, interp 0x0x555be8bfa380
> (epoch 17)
> Source "regexp - $str"
> Cmds 1, src 13, inst 10, litObjs 3, aux 0, stkDepth 3, code/src 0.00
> Commands 1:
> 1: pc 0-8, src 0-12
> Command 1: "regexp - $str"
> (0) push1 0 # "regexp"
> (2) push1 1 # "-"
> (4) push1 2 # "str"
> (6) loadStk
> (7) invokeStk1 3
> (9) done

Byte-code compiler cannot optimize regexp with invalid "switch" "-"; therefore, it simply invokes the actual proc (that will bail out).

> % ::tcl::unsupported::disassemble script {regexp \- $str}
> ByteCode 0x0x555be8c66180, refCt 1, epoch 17, interp 0x0x555be8bfa380
> (epoch 17)
> Source "regexp \- $str"
> Cmds 1, src 14, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
> Commands 1:
> 1: pc 0-6, src 0-13
> Command 1: "regexp \- $str"
> (0) push1 0 # "-"
> (2) push1 1 # "str"
> (4) loadStk
> (5) regexp +3
> (7) done

The byte-code compiler seems to not detect the erroneous first argument. It fails to apply backslash substitution before looking for switches ... therefore, it produces the "correct" invocation of regexp.

Erik Leunissen

unread,
Sep 18, 2022, 7:34:59 AM9/18/22
to
A bug report, referring to this discussion thread, has been filed at:

https://core.tcl-lang.org/tcl/tktview?name=697b1bbfe3

heinrichmartin

unread,
Sep 18, 2022, 10:26:03 AM9/18/22
to
"Why are other commends with options not affected? Are they?" came to my mind.
Well, lsearch is not.

lsearch has optional arguments only on one side of the required ones, but regexp has options and optional trailing arguments, which makes interpretation ambiguous and therefore requires --.
But calling *regexp with exactly two arguments is not ambiguous* at all! However, the man page clearly states "If the initial arguments to regexp start with - then they are treated as switches.".

Having that said, my interpretation could have been wrong:
Or byte-code compiler has a shortcut for exactly two arguments, which is against the doc.

As stopping here is unsatisfying ... for those who are interested:

2062 /*
2063 * We are only interested in compiling simple regexp cases. Currently
2064 * supported compile cases are:
2065 * regexp ?-nocase? ?--? staticString $var
2066 * regexp ?-nocase? ?--? {^staticString$} $var
2067 */

And a few lines later, we can confirm that the compiler is _not_ looking at the arguments, if there are only two of them.

2084 for (i = 1; i < parsePtr->numWords - 2; i++) {

Finally, let's cross-check by adding more args:

% set str foo
% eval {regexp \- $str match} ;# bails out
% eval {regexp \- $str} ;# 0

Bottom line: Byte-code compiler fails to implement "If the initial arguments to regexp start with - then they are treated as switches."; it should refuse to compile (i.e. leave to runtime), if the first of two words starts with a dash.
Reply all
Reply to author
Forward
0 new messages