possible bug in "regexp" syntax ".*?"

83 views
Skip to first unread message

aotto1968

unread,
Jun 26, 2022, 3:05:20 PMJun 26
to
Hi,

my code

# @cartouche
while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
smallexample[^@]+@end cartouche} $txt allI valI]} {
set val [string trim [string range $txt {*}$valI]]
set val [string map { \\@\{ @\{ } $val ]
set txt [string replace $txt {*}$allI "~~~~~\n$val\n~~~~~" ]
}

with data from:

@cartouche
@smallexample
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
@end smallexample
@end cartouche

@cartouche
@smallexample
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
@end smallexample
@end cartouche

create the *false* match:

~~~~~
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
@end smallexample
@end cartouche

@cartouche
@smallexample
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
~~~~~

because of error in ".*?" → should find the *smalles* match.
the *good* solution would be:

~~~~~
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
~~~~~

~~~~~
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
~~~~~

example online: https://regex101.com/r/AWkpiL/1

mfg

aotto1968

unread,
Jun 26, 2022, 3:19:36 PMJun 26
to
Just as info, the following code do the job (split-by-hand)

# @cartouche
while {[regexp -indices {@cartouche[^@]+@smallexample} $txt allI]} {
lassign $allI start1I start2I
if {![regexp -indices -start $start2I
{@end\s+smallexample[^@]+@end\s+cartouche} $txt allI]} {
error "don't find end of '@cartouche'"
}
lassign $allI end1I end2I
set val [string trim [string range $txt $start2I+1 $end1I-1]]
set val [string map { \\@\{ @\{ } $val ]
set txt [string replace $txt $start1I $end2I "~~~~~\n$val\n~~~~~" ]
}

mfg

Christian Gollwitzer

unread,
Jun 26, 2022, 3:40:27 PMJun 26
to
Am 26.06.22 um 21:05 schrieb aotto1968:
> Hi,
>
> my code
>
>   # @cartouche
>   while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
> smallexample[^@]+@end cartouche} $txt allI valI]} {

I haven't checked the details, but your RE mixes greedy (+) and
non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand
behaviour. Quote from the manpage:
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
"The matching rules for REs containing both normal and non-greedy
quantifiers have changed since early beta-test versions of this package.
(The new rules are much simpler and cleaner, but do not work as hard at
guessing the user's real intentions.) "

The page you used to test the RE uses PCRE, the most widely known RE
engine, as opposed to Henry Spencer's RE engine as used by Tcl. In
certain corner cases (esp. for maliciously crafted REs), the algorithm
used by PCRE can be exponentially slow whereas Spencer's engine is
polynomial, however, on average for non-malicious REs PCRE is faster.

One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE,
which is also badly maintained. Someone with more moment than me could
write a TIP about it ;)

Christian

heinrichmartin

unread,
Jun 27, 2022, 1:55:10 AMJun 27
to
On Sunday, June 26, 2022 at 9:40:27 PM UTC+2, Christian Gollwitzer wrote:
> Am 26.06.22 um 21:05 schrieb aotto1968:
> > Hi,
> >
> > my code
> >
> > # @cartouche
> > while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
> > smallexample[^@]+@end cartouche} $txt allI valI]} {
> I haven't checked the details, but your RE mixes greedy (+) and
> non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand
> behaviour. Quote from the manpage:
> https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
> "The matching rules for REs containing both normal and non-greedy
> quantifiers have changed since early beta-test versions of this package.
> (The new rules are much simpler and cleaner, but do not work as hard at
> guessing the user's real intentions.) "

You are looking for the description of the _preference_ of a RE. Take-away message: you cannot mix greedy and non-greedy matching; and the *first* quantifier determines the preference of the full RE.

I often fall back to _not_ create the super-duper long regexp that matches the whole thing from a stream (I am using Expect a lot), but match the header first (anchored at the start of the buffer, discarding junk), then find the matching footer; i.e. when magic[1] does not help, go the extra mile and implement that state-machine.

Having that said, you could also look into packages that facilitate parsing (lexers). There used to be fickle/fcl (that was used by tcldoc; I am writing in past-tense, because a quick web search did not show their current homes).

[1] https://xkcd.com/208/

Rich

unread,
Jun 27, 2022, 9:49:50 AMJun 27
to
heinrichmartin <martin....@frequentis.com> wrote:
> Having that said, you could also look into packages that facilitate
> parsing (lexers). There used to be fickle/fcl (that was used by
> tcldoc; I am writing in past-tense, because a quick web search did
> not show their current homes).
>
> [1] https://xkcd.com/208/

Tcllib's PEG parser generator is relatively easy to use, once one
grasps the syntax.

http://tmml.sourceforge.net/doc/tcllib/peg.html

There was also a nice example posted here to the group some months back
for a PEG grammar. I forget now who posted it.

Oleg Nemanov

unread,
Jul 11, 2022, 8:44:11 AMJul 11
to
воскресенье, 26 июня 2022 г. в 22:40:27 UTC+3, Christian Gollwitzer:
> One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE,
> which is also badly maintained. Someone with more moment than me could
> write a TIP about it ;)

No-no. Let's stay RE engine as is :-). If someone want to use PCRE he can do this
with external lib even now.

Reply all
Reply to author
Forward
0 new messages