From the answer to my own posting ca. 2020:
Subject: Re: Why does adding \s* to my RE change non-greedy to greedy?
On 4/13/2020 5:35 PM, heinrichmartin wrote:
> On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:
>> Tcl 8.6.8, win7x64
>>
>> Adding \s* to my RE changes .*? from non-greedy to greedy
>>
>> Test script:
>>
>> proc Test {re text} {
>> puts "\nRe: \"$re\""
>> puts "Matching against \"$text\""
>>
>> set n 0
>> foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
>> lassign $match s e
>> puts "Match [incr n]: [string range $text $s $e]"
>> lassign $sub1 s e
>> puts " $n.1: [string range $text $s $e]"
>> lassign $sub2 s e
>> puts " $n.2: [string range $text $s $e]"
>> }
>> }
>>
>> set string "...<i>111</i>..<i>22</i>.."
>>
>> Test {<([ib])>(.*?)</\1>} $string
>>
>> Test {<([ib])\s*>(.*?)</\1\s*>} $string
>>
>> Output:
>>
>> Re: "<([ib])>(.*?)</\1>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>
>> 1.1: i
>> 1.2: 111
>> Match 2: <i>22</i>
>> 2.1: i
>> 2.2: 22
>>
>> Re: "<([ib])\s*>(.*?)</\1\s*>"
>> Matching against "...<i>111</i>..<i>22</i>.."
>> Match 1: <i>111</i>..<i>22</i>
>> 1.1: i
>> 1.2: 111</i>..<i>22
>> (Temp) 1 %
>>
>> The "(.*?)" is no longer non-greedy. Why?
>
> First preference wins, see
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
greedy vs non-greedy preference.
>
> Two more remarks:
> * \s* is followed by non-whitespace ">", make it non-greedy.
> * Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.
>
Thank you. I had skimmed past that part because I thought that the (.*)?
was sufficient. My re is now {<{1,1}?([ib])\s*>(.*?)</\1\s*>} and it is
working fine.
--
computerjock AT mail DOT com