Regular expression challenge

4 views
Skip to first unread message

Fred H Olson

unread,
Apr 14, 2021, 6:36:34 PM4/14/21
to Semware
I've been trying to develop one regular expression that will
parse 2 to 4 parameters from SAL style parameter strings ( "p-str"s ).
I've got something that almost works...

Below the proc parse3() (also in attached file)
does lfind() with this regular expression and
displays the results for the 3 possible formats of p-str ( 2, 3 and 4
parameters) one line at a time. After each, it displays the p-str and
the parsed parameters delimited by vertical bars.
An empty parameter is displayed as "||". parse3() is assigned tp <f5>.

It seems to do what I want EXCEPT for the 3 parameter case which
shows the 3rd parameter as empty and shows it as the 4th parameter!
Curiously the 4 parameter case displays what I expect.

It seems like the problem might be related to the numbering of
sub-patterns so I also display the 6 sub-patterns the way I think they
are numbered which is also shown in comments above the regular
expression below.

Regular expressions can be difficult to understand.
For reference I've quoted the TSE help (partial) for the
? and {} notation used at the end of this file.

Can anyone explain the problem with the 3 parameter case?

Fred <fho...@cohousing.org>

START:: 3 possible formats of p-str :
2 p-str: ("abc","ig") :
3 p-str: ("abc","ig","n") :
4 p-str: ("abc","ig","x","300") :

*/

string p1[15]
string p2[15]
string p3[15]
string p4[15]
string f1[15]
string f2[15]
string f3[15]
string f4[15]
string f5[15]
string f6[15]
string parms[50]

proc parse_one_line()
parms=gettext(1,40)
begline()
// -p1- -p2- -p3- -p4- < parm numbers
// 1 2 3 4 5 6 < sub-pattern numbers
lfind('("{.*}","{.*}"{,"{.*}"}?{,"{.*}"}?)','xc')
p1=GetFoundText(1)
p2=GetFoundText(2)
p3=GetFoundText(4)
p4=GetFoundText(6)
warn(parms , ' p1=|',p1,'| ','p2=|',p2,'| ','p3=|',p3,'| ','p4=|',p4,'|')

f1=GetFoundText(1)
f2=GetFoundText(2)
f3=GetFoundText(3)
f4=GetFoundText(4)
f5=GetFoundText(5)
f6=GetFoundText(6)
warn(parms , ' f1=|',f1,'| f2=|',f2,'| f3=|',f3,'| f4=|',f4,'| f5=|',f5,'| f6=|',f6,'|')
end

proc parse3()
if lfind("START::","g")
down() parse_one_line()
down() parse_one_line()
down() parse_one_line()
else Warn("Could not find lines to parse")
endif
end

<f5> parse3()
/* excerpt from TSE help:

? In a search pattern, optionally matches the preceding sub-pattern.

Example:

Search pattern: colou?r

matches the strings color or colour

{ } In a search pattern, serves as a Tag to identify a sub-pattern within
the full search pattern. Tagged patterns can be nested.

Tags are used to define a group of characters as a sub-pattern so
that an operator acts on more than one character or character Class.

Tags are also used to identify a sub-pattern within a Regular
Expression so the sub-pattern can be separately referenced in a
subsequent replacement. Tagged sub-patterns are implicitly numbered
from 1 through 9 based on the leftmost "{" symbol. The sub-pattern
number can be used within a replacement string to reference a tagged
sub-pattern, using the following format:

\n

where "n" is the actual sub-pattern number from 1 - 9 that
represents the appropriate tagged sub-pattern. To identify the FULL
search pattern, "n" is "0" (that is, \0).
*/


--
Fred H. Olson Minneapolis,MN 55411 USA (near north Mpls)
Email: fholson at cohousing.org 612-588-9532
My Link Pg: http://fholson.cohousing.org
parse-test.s

Carlo Hogeveen

unread,
Apr 14, 2021, 9:42:14 PM4/14/21
to sem...@googlegroups.com

Fred,

If I may be so bold, that is an excellent example/test macro.

The reason your example "fails" for the given three-parameter string is, that a regular expression's "?" character is non-greedy.
In TSE's documentation's terms for other regular expression characters: "?" matches "with minimal closure".
Both mean to say that {...}? tries to match the empty string first.
For instance, if "Warn" occurs in a text and you search for "Warn?" you will find "War" (highlighted).
Likewise, if "Warning" occurs in a text and you search for "Warn?ing", then it will only find "Warning" if "Waring" does not occur first.
So when your lFind() searches for {...}? twice, then for your three-parameter string the regular expression encounters no problem matching the first {...}? with an empty string, so it does.

I did find a solution to your specific challenge that for all of your example strings with N parameters will return each Nth parameter with GetFoundText(N + 1).
This solution uses the undocumented hack, that if a sequence of patterns only partially match before the regular expression character "|" and the expression after the "|" succeeds, then the patterns are still assigned to GetFoundText(<pattern number>).
So a solution to your specific challenge is:
lFind('{("{[~"]*}","{[~"]*}","{[~"]*}","{[~"]*}"}|{"}','xc')
The last double quote can be replaced by any character or string that will succeed for your example parameter strings.

Carlo



Fred H Olson

unread,
Apr 15, 2021, 1:06:56 AM4/15/21
to sem...@googlegroups.com
Wow Carlo, I sort of understand your explanation and your solution which
works fine. See a couple inline comments below

On Thu, 15 Apr 2021, Carlo Hogeveen wrote:

> Fred,
>
> If I may be so bold, that is an excellent example/test macro.

Glad you liked it.
By trial and error I found it can be "," (comma) but not ")" which is fine.

Of course I was still not quite satisfied. I wanted the delimiter in the
p-str to be either single or double quote. So I put it in a single character
string variable:
string d[1]
where d will be set to ' or " appropriately for the p-str.
and the lfind becomes:

lFind('{('+d+'{[~'+d+']*}'+d+','+d+'{[~'+d+']*}'+
d+','+d+'{[~'+d+']*}'+d+','+d+'{[~'+d+']*}'+d+'}|{,}','xc')

Sure glad that once working these general expression search strings
dont have to be thought about again.

Thanks a lot Carlo.

Fred
Reply all
Reply to author
Forward
0 new messages