Question: parsing nested records

charlemagn...@gmail.com

unread,

Nov 7, 2016, 10:30:42 AM11/7/16

to

Hi, I'm not sure how to solve this problem in GNU awk.

Given a data file with two lines:

{{ .. 1 .. }}
{{ .. 2 {{ .. 3 .. }} .. }} {{ .. 4 .. }}

The records start with {{ and end with }}. Sometimes there is more than 1 record in a line (2 and 4) and sometimes there is a record nested inside another (2 and 3).

What would be a generalized solution to printing the records (order doesn't matter):

{{ .. 1 .. }}
{{ .. 2 .. }}
{{ .. 3 .. }}
{{ .. 4 .. }}

A solution shouldn't assume the data between {{}} is what is shown here, it could be anything. Thanks for any help or suggestions.

Marc de Bourget

unread,

Nov 7, 2016, 11:20:01 AM11/7/16

to

Maybe you should use the match function. What have you tried so far?

charlemagn...@gmail.com

unread,

Nov 7, 2016, 11:27:15 AM11/7/16

to

> Maybe you should use the match function. What have you tried so far?

Match uses regex eg /{{[^}]*}}/. GNU Awk is greedy and will match to the last }} thus unable to parse the embedded record or records on the same line.

Kenny McCormack

unread,

Nov 7, 2016, 11:32:17 AM11/7/16

to

In article <e2ad06e4-b3fc-4ed4...@googlegroups.com>,

I'm getting pretty close with:

% gawk4 '{ while (match($0,/{{[^{}]+}}/)) { print substr($0,RSTART,RLENGTH);$0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) }}' yourfile
{{ .. 1 .. }}
{{ .. 3 .. }}
{{ .. 2 .. }}
{{ .. 4 .. }}
%

Yes, I realize it isn't perfect, but it should get you started.

--

Prayer has no place in the public schools, just like facts
have no place in organized religion.
-- Superintendent Chalmers

charlemagn...@gmail.com

unread,

Nov 7, 2016, 12:29:46 PM11/7/16

to

Thanks. Interesting.

The solution works for the data set, but when the data is changed a little:

{{ .. 8 {{ .. 9 .. }} .. }} {{ .. 10 {{ .. 11 .. }} .. }} {{ .. 12 .. }}

Unfortunately I have no idea how many records there will be or if they will be single records (like 12) or nested (like 8 and 9).

Kenny McCormack

unread,

Nov 7, 2016, 1:01:35 PM11/7/16

to

In article <a6414292-c0e4-45cc...@googlegroups.com>,

It would be a good thing if you would quote (i.e., learn how to quote) so
that we could tell to whom you are responding.

In any case, my solution works on the above data as well:

% gawk4 '{ while (match($0,/{{[^{}]+}}/)) { print substr($0,RSTART,RLENGTH);$0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) }}' <<X
? {{ .. 8 {{ .. 9 .. }} .. }} {{ .. 10 {{ .. 11 .. }} .. }} {{ .. 12 .. }}
? X
{{ .. 9 .. }}
{{ .. 8 .. }}
{{ .. 11 .. }}
{{ .. 10 .. }}
{{ .. 12 .. }}
%

--
People who say they'll vote for someone else because Obama couldn't fix
*all* of Bush's messes are like people complaining that he couldn't cure
cancer, so they'll go and vote for (more) cancer.

charlemagn...@gmail.com

unread,

Nov 7, 2016, 1:24:04 PM11/7/16

to

On Monday, November 7, 2016 at 1:01:35 PM UTC-5, Kenny McCormack wrote:
> It would be a good thing if you would quote (i.e., learn how to quote) so
> that we could tell to whom you are responding.
>
> In any case, my solution works on the above data as well:

I'm not using command line so I translated into a script and changed some things so that must be why. Anyway, I am thankfully your regex suggestion but with patsplit which seems to work and has the advantage of saving other parts of the string between the records (the sep) which will be useful. This is not complete but approximate.

@include "readfile"
BEGIN {

c = split(readfile("test.txt"),a,"\n")
while(i++ < c) {
pc = patsplit(a[i], b, /{{[^{}]+}}/, sep)
j = 0
while(j++ < pc) {
if(b[j]) print b[j]
gsub(b[j],"",a[i])
}
if(a[i]) print a[i]
}
}

Joe User

unread,

Nov 7, 2016, 2:02:31 PM11/7/16

to

charlemagn...@gmail.com wrote:

> I'm not using command line so I translated into a script and changed some
> things so that must be why. Anyway, I am thankfully your regex suggestion
> but with patsplit which seems to work and has the advantage of saving
> other parts of the string between the records (the sep) which will be
> useful. This is not complete but approximate.

You don't specify the level of nesting to be expected. Can the nesting be
4-deep?

You don't specify what to do with unmatched {{ and }} strings.

But, for arbitrarily deep nesting, you should probably make a re-entrant
function that removes the deepest enclosed strings, and then calls itself
with the modified string.

Just a suggestion.

Kenny McCormack

unread,

Nov 7, 2016, 2:03:08 PM11/7/16

to

In article <5ec5c86d-1712-40a4...@googlegroups.com>,

<charlemagn...@gmail.com> wrote:
>On Monday, November 7, 2016 at 1:01:35 PM UTC-5, Kenny McCormack wrote:
>> It would be a good thing if you would quote (i.e., learn how to quote) so
>> that we could tell to whom you are responding.
>>
>> In any case, my solution works on the above data as well:
>
>I'm not using command line so I translated into a script and changed some things
>so that must be why. Anyway, I am thankfully your regex suggestion but with
>patsplit which seems to work and has the advantage of saving other parts of the
>string between the records (the sep) which will be useful. This is not complete
>but approximate.

I'm glad you were able to answer your own question. Ultimately, that's the
best result in any situation like this.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain in
compliance with said RFCs, the actual sig can be found at the following web address:
http://www.xmission.com/~gazelle/Sigs/Snicker

Ed Morton

unread,

Nov 7, 2016, 6:01:33 PM11/7/16

to

You said "A solution shouldn't assume the data between {{}} is what is shown
here, it could be anything" - the solution you posted will fail when your
"anything" between pairs of curly brackets can contain an individual { or }, e.g.:

{{ x{y }}

This will work for any input:

$ awk '{
gsub(/a/,"aA")
gsub(/{/,"aB")
gsub(/}/,"aC")
gsub(/aBaB/,"{")
gsub(/aCaC/,"}")
while ( match($0,/(.*)({[^{}]*})(.*)/,a) ) {
gsub(/[{}]/,"&&",a[2])
gsub(/aA/,"a",a[2])
gsub(/aB/,"{",a[2])
gsub(/aC/,"}",a[2])
print a[2]
$0 = a[1] a[3]
}
}' file
{{ .. 12 .. }}

It uses GNU awk for the 3rd arg to match, with other awks use substr() instead.

Ed.

charlemagn...@gmail.com

unread,

Nov 8, 2016, 10:09:45 AM11/8/16

to

On Monday, November 7, 2016 at 6:01:33 PM UTC-5, Ed Morton wrote:
> It uses GNU awk for the 3rd arg to match, with other awks use substr() instead.
>
> Ed.

Hey thanks, Ed. That's actually a possibility (a singe { or } in the data) so this is good. I've used the third argument often with the zero index but never 1+ - I looked it up to see how it worked with parenthesis regex (I recall seeing it before but it didn't sink in why it might be used). It looks like a shortcut to the substr()/RSTART method so this sort of parsing must be common-enough. Good to know about.

Ed Morton

unread,

Nov 8, 2016, 10:55:49 AM11/8/16

to

It's just a way of identifying capture groups and saving the matching strings as
array elements. Just like sed has 's/$x$$y$/\1-\2/' gawk in addition to
gensub() for THAT simple task has 'match($0,/(x)(y)/,a) { print a[1]"-"a[2] }'
which is much more powerful because you can do other operations on the captured
strings, e.g. 'match($0,/(x)(y)/,a) { print a[1]*7"-"int(a[2]) }'.

btw I just noticed I had a bug in my script, I wasn't unwinding the sub()s in
the right order so by converting "aA" back to "a" first I was creating the
possibility of "aB" being in the field as something other than the {-replacement
I originally created it to be. Basically you need to unwind the sub()s on each
field in the reverse order you did them on the record for the script to be robust.

This is how it should be done:

$ awk '{
gsub(/a/,"aA")
gsub(/{/,"aB")
gsub(/}/,"aC")
gsub(/aBaB/,"{")
gsub(/aCaC/,"}")
while ( match($0,/(.*)({[^{}]*})(.*)/,a) ) {
gsub(/[{}]/,"&&",a[2])

gsub(/aC/,"}",a[2])

gsub(/aB/,"{",a[2])
gsub(/aA/,"a",a[2])

print a[2]
$0 = a[1] a[3]
}
}' file

Regards,

Ed.