An RE mystery

62 views
Skip to first unread message

Helmut Giese

unread,
Sep 4, 2021, 4:20:00 PMSep 4
to
Hello out there,
as an answer to my yesterday's post 'ISO conversion tool for text
widgets' Dave posted code which contains an intriguing RE. After some
head scratching I finally understand it - almost. The remaining puzzle
is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?

The following code demonstrates what I mean:
---
set txt "This is normal text while this is <i>italic</i>
and this is <i>too</i>."

set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}

set ranges [regexp -all -indices -inline $re1 $txt]
puts "Version re1"
puts $ranges
puts ""
set ranges [regexp -all -indices -inline $re2 $txt]
puts "Version re2"
puts $ranges
---
Why is this? As per the man page: Isn't
'a sequence of exactly 1 match of the atom'
the same as
'a sequence of 1 to 1 (inclusive) matches of the atom'?

Any enlightenment will be greatly appreciated.
Helmut

Dave

unread,
Sep 4, 2021, 5:25:20 PMSep 4
to
From the answer to my own posting ca. 2020:

Subject: Re: Why does adding \s* to my RE change non-greedy to greedy?

On 4/13/2020 5:35 PM, heinrichmartin wrote:
> On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:
>> Tcl 8.6.8, win7x64
>>
>> Adding \s* to my RE changes .*? from non-greedy to greedy
>>
>> Test script:
>>
>> proc Test {re text} {
>> puts "\nRe: \"$re\""
>> puts "Matching against \"$text\""
>>
>> set n 0
>> foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
>> lassign $match s e
>> puts "Match [incr n]: [string range $text $s $e]"
>> lassign $sub1 s e
>> puts " $n.1: [string range $text $s $e]"
>> lassign $sub2 s e
>> puts " $n.2: [string range $text $s $e]"
>> }
>> }
>>
>> set string "...<i>111</i>..<i>22</i>.."
>>
>> Test {<([ib])>(.*?)</\1>} $string
>>
>> Test {<([ib])\s*>(.*?)</\1\s*>} $string
>>
>> Output:
>>
>> Re: "<([ib])>(.*?)</\1>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>
>> 1.1: i
>> 1.2: 111
>> Match 2: <i>22</i>
>> 2.1: i
>> 2.2: 22
>>
>> Re: "<([ib])\s*>(.*?)</\1\s*>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>..<i>22</i>
>> 1.1: i
>> 1.2: 111</i>..<i>22
>> (Temp) 1 %
>>
>> The "(.*?)" is no longer non-greedy. Why?
>
> First preference wins, see
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
greedy vs non-greedy preference.
>
> Two more remarks:
> * \s* is followed by non-whitespace ">", make it non-greedy.
> * Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.
>

Thank you. I had skimmed past that part because I thought that the (.*)?
was sufficient. My re is now {<{1,1}?([ib])\s*>(.*?)</\1\s*>} and it is
working fine.


--
computerjock AT mail DOT com

Helmut Giese

unread,
Sep 5, 2021, 1:41:20 PMSep 5
to
Hi Dave,
well, your post didn't really answer my origial question
why is {1}? != {1,1}?
But since the URL you cited also mentions only {1,1}? to
'make a RE non-greedy overall'
and never talks about {1}? I think I'll leave it at that.
Thanks for the follow-up.
Helmut

Conor Williams

unread,
Oct 11, 2021, 3:42:15 PM (13 days ago) Oct 11
to
cool.

Conor Williams

unread,
Oct 11, 2021, 7:46:27 PM (12 days ago) Oct 11
to
i usually use a flavour of vi for my regexes (maybe sed the odd time either...)

but.. (in tcl) {4,5} on a dot means match at least 4 and at most 5
and {4} following a dot means match 4

/c:202111102339:23

2 cases: 1@
txt:
<b>a</b>e<b>bbb</b>f<b>cccc</b>g<b>ddddd</b>h<b>aaaaaaaa</b>

regex:
<b>.{4,5}</b>

yields:
<b>cccc</b> <b>ddddd</b>
:2@
regex
<b>.{4}</b>
yields:
<b>cccc</b>
Reply all
Reply to author
Forward
0 new messages