Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

An RE mystery

65 views
Skip to first unread message

Helmut Giese

unread,
Sep 4, 2021, 4:20:00 PM9/4/21
to
Hello out there,
as an answer to my yesterday's post 'ISO conversion tool for text
widgets' Dave posted code which contains an intriguing RE. After some
head scratching I finally understand it - almost. The remaining puzzle
is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?

The following code demonstrates what I mean:
---
set txt "This is normal text while this is <i>italic</i>
and this is <i>too</i>."

set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}

set ranges [regexp -all -indices -inline $re1 $txt]
puts "Version re1"
puts $ranges
puts ""
set ranges [regexp -all -indices -inline $re2 $txt]
puts "Version re2"
puts $ranges
---
Why is this? As per the man page: Isn't
'a sequence of exactly 1 match of the atom'
the same as
'a sequence of 1 to 1 (inclusive) matches of the atom'?

Any enlightenment will be greatly appreciated.
Helmut

Dave

unread,
Sep 4, 2021, 5:25:20 PM9/4/21
to
From the answer to my own posting ca. 2020:

Subject: Re: Why does adding \s* to my RE change non-greedy to greedy?

On 4/13/2020 5:35 PM, heinrichmartin wrote:
> On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:
>> Tcl 8.6.8, win7x64
>>
>> Adding \s* to my RE changes .*? from non-greedy to greedy
>>
>> Test script:
>>
>> proc Test {re text} {
>> puts "\nRe: \"$re\""
>> puts "Matching against \"$text\""
>>
>> set n 0
>> foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
>> lassign $match s e
>> puts "Match [incr n]: [string range $text $s $e]"
>> lassign $sub1 s e
>> puts " $n.1: [string range $text $s $e]"
>> lassign $sub2 s e
>> puts " $n.2: [string range $text $s $e]"
>> }
>> }
>>
>> set string "...<i>111</i>..<i>22</i>.."
>>
>> Test {<([ib])>(.*?)</\1>} $string
>>
>> Test {<([ib])\s*>(.*?)</\1\s*>} $string
>>
>> Output:
>>
>> Re: "<([ib])>(.*?)</\1>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>
>> 1.1: i
>> 1.2: 111
>> Match 2: <i>22</i>
>> 2.1: i
>> 2.2: 22
>>
>> Re: "<([ib])\s*>(.*?)</\1\s*>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>..<i>22</i>
>> 1.1: i
>> 1.2: 111</i>..<i>22
>> (Temp) 1 %
>>
>> The "(.*?)" is no longer non-greedy. Why?
>
> First preference wins, see
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
greedy vs non-greedy preference.
>
> Two more remarks:
> * \s* is followed by non-whitespace ">", make it non-greedy.
> * Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.
>

Thank you. I had skimmed past that part because I thought that the (.*)?
was sufficient. My re is now {<{1,1}?([ib])\s*>(.*?)</\1\s*>} and it is
working fine.


--
computerjock AT mail DOT com

Helmut Giese

unread,
Sep 5, 2021, 1:41:20 PM9/5/21
to
Hi Dave,
well, your post didn't really answer my origial question
why is {1}? != {1,1}?
But since the URL you cited also mentions only {1,1}? to
'make a RE non-greedy overall'
and never talks about {1}? I think I'll leave it at that.
Thanks for the follow-up.
Helmut

Conor Williams

unread,
Oct 11, 2021, 3:42:15 PM10/11/21
to
cool.

Conor Williams

unread,
Oct 11, 2021, 7:46:27 PM10/11/21
to
i usually use a flavour of vi for my regexes (maybe sed the odd time either...)

but.. (in tcl) {4,5} on a dot means match at least 4 and at most 5
and {4} following a dot means match 4

/c:202111102339:23

2 cases: 1@
txt:
<b>a</b>e<b>bbb</b>f<b>cccc</b>g<b>ddddd</b>h<b>aaaaaaaa</b>

regex:
<b>.{4,5}</b>

yields:
<b>cccc</b> <b>ddddd</b>
:2@
regex
<b>.{4}</b>
yields:
<b>cccc</b>
0 new messages