Message from discussion
multi-line + sub-regex
Received: by 10.224.193.72 with SMTP id dt8mr17460634qab.7.1351613276992;
Tue, 30 Oct 2012 09:07:56 -0700 (PDT)
Received: by 10.52.65.33 with SMTP id u1mr6184779vds.18.1351613276966; Tue, 30
Oct 2012 09:07:56 -0700 (PDT)
Path: gf5ni6293802qab.0!nntp.google.com!e17no14138291qar.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Tue, 30 Oct 2012 09:07:56 -0700 (PDT)
In-Reply-To: <k6oofi$7va$1@news.xmission.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2001:660:5001:142:ea39:35ff:fe46:2882;
posting-account=5syELgoAAABMLWsjbxhk8Wo7CLxGgTPG
NNTP-Posting-Host: 2001:660:5001:142:ea39:35ff:fe46:2882
References: <59217ed3-0e6e-4920-b071-5972567b3104@googlegroups.com> <k6oofi$7va$1@news.xmission.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <76402c5e-3b4c-4cbf-9af1-1e3f06608c27@googlegroups.com>
Subject: Re: multi-line + sub-regex
From: mathieu.malate...@gmail.com
Injection-Date: Tue, 30 Oct 2012 16:07:56 +0000
Content-Type: text/plain; charset=ISO-8859-1
On Tuesday, October 30, 2012 3:33:22 PM UTC+1, Kenny McCormack wrote:
> In article <59217ed3-0e6e-4920-b071-5972567b3104@googlegroups.com>,
>
> <mathieu.malaterre> wrote:
>
> >Hi there,
>
> >
>
> > I am trying to use awk to parse a multiline expression. A single one
>
> >of them looks like this:
>
> >
>
> >_begin bla
>
> >_attrib0 123
>
> >_attrib1 456
>
> >_attrib1 789
>
> >_attrib2 foo
>
> >_end
>
> >...
>
> >
>
> >I need to extract the value associated to _begin and _attrib1. So in the
>
> >example, the awk script should return (one per line):
>
> >
>
> >bla 456 789
>
> >
>
> >Thanks for comments !
>
>
>
> Here's a way to do it that involves manipulating the "internal variables" -
>
> a technique that I usually argue against (i.e., think should be used with
>
> caution) - but I think you will find it interesting. Note to other group
>
> geeks: This is not necessarily the easiest (or least byte count) way to do
>
> this task, but, as I said, I find this approach interesting.
>
>
>
> # Yes, the output will have a trailing blank...
>
> BEGIN {ORS=" ";RS="_end\n";FS="\n| "}
>
> {for (i=1; i<=NF; i+=2)
>
> if ($i == "_begin" || $i == "_attrib1") print $(i+1)
>
> printf "\n"}
This seems to work somewhat when input contains spaces, but fails when input contains tag characters. My input file is something like
_begin hello world !
_attrib0 123
_attrib1 super duper
_attrib1 yet another value
_attrib2 foo
_end
So space is not a separator in my case, only tab.
Thanks