This message evolved in a non-linear way while thinking/trying. I hope I cleaned up enough to prevent confusion ...
On Saturday, September 17, 2022 at 5:09:10 PM UTC+2, Schelte wrote:
> That seems to me like there's a bug lurking somewhere.
Just guessing: the byte-code compiler treats regexp special; and in this case it gets switch-handling wrong, i.e. it does not obey the Dodekalogue for switches.
Not just guessing: if we override regexp, the issue is gone.
set str foo
eval {regexp \- $str} ;# 0
rename regexp tcl_regexp
proc regexp args {tailcall tcl_regexp {*}$args}
eval {regexp \- $str} ;# bad option "-"
Using -- fixes the issue, too.
eval {regexp -- \- $str} ;# 0
eval {regexp -- - $str} ;# 0
Back to the original command. Still with -- in place, byte-code (obviously?) differs from Schelte's one ...
% set tcl_patchLevel
8.6.4
% ::tcl::unsupported::disassemble script {regexp -- \- $str}
ByteCode 0x0x23c31d0, refCt 1, epoch 15, interp 0x0x22fc680 (epoch 15)
Source "regexp -- \- $str"
Cmds 1, src 17, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
Commands 1:
1: pc 0-6, src 0-16
Command 1: "regexp -- \- $str"
(0) push1 0 # "-"
(2) push1 1 # "str"
(4) loadStk
(5) regexp +3
(7) done
% ::tcl::unsupported::disassemble script {regexp -- - $str}
ByteCode 0x0x23c33d0, refCt 1, epoch 15, interp 0x0x22fc680 (epoch 15)
Source "regexp -- - $str"
Cmds 1, src 16, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
Commands 1:
1: pc 0-6, src 0-15
Command 1: "regexp -- - $str"
(0) push1 0 # "*-*"
(2) push1 1 # "str"
(4) loadStk
(5) strmatch +0
(7) done
Now, this is really interesting: Tcl optimizes regexp by replacing it with string match for trivial pattern "-".
And now, I can also understand the byte-code better.
> On 16/09/2022 21:06, Erik Leunissen wrote:
> > % regexp - $str; #1
> > bad option "-": must be -all, -about, -indices, -inline, -expanded,
> > -line, -linestop, -lineanchor, -nocase, -start, or --
> > % regexp \- $str; #2
> > 1
> While these should be the exact same thing, they produce different byte
> codes:
>
> % ::tcl::unsupported::disassemble script {regexp - $str}
> ByteCode 0x0x555be8c65680, refCt 1, epoch 17, interp 0x0x555be8bfa380
> (epoch 17)
> Source "regexp - $str"
> Cmds 1, src 13, inst 10, litObjs 3, aux 0, stkDepth 3, code/src 0.00
> Commands 1:
> 1: pc 0-8, src 0-12
> Command 1: "regexp - $str"
> (0) push1 0 # "regexp"
> (2) push1 1 # "-"
> (4) push1 2 # "str"
> (6) loadStk
> (7) invokeStk1 3
> (9) done
Byte-code compiler cannot optimize regexp with invalid "switch" "-"; therefore, it simply invokes the actual proc (that will bail out).
> % ::tcl::unsupported::disassemble script {regexp \- $str}
> ByteCode 0x0x555be8c66180, refCt 1, epoch 17, interp 0x0x555be8bfa380
> (epoch 17)
> Source "regexp \- $str"
> Cmds 1, src 14, inst 8, litObjs 2, aux 0, stkDepth 2, code/src 0.00
> Commands 1:
> 1: pc 0-6, src 0-13
> Command 1: "regexp \- $str"
> (0) push1 0 # "-"
> (2) push1 1 # "str"
> (4) loadStk
> (5) regexp +3
> (7) done
The byte-code compiler seems to not detect the erroneous first argument. It fails to apply backslash substitution before looking for switches ... therefore, it produces the "correct" invocation of regexp.