Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

*really* shortest match in awk - possible?

278 views
Skip to first unread message

Tomasz Chmielewski

unread,
Nov 21, 2007, 1:18:33 PM11/21/07
to
Yes, everything is possible in awk, but not for mere mortals, right?

I have a mbox file (see below for an example) with advertisements
attached to the end of every message. I would like to remove these ads.
Ads are placed between -------- and _________ characters.

I tried using something like:

awk '/^----/,/^____/{next}{print}'

but it eats up a bit too much, and one could argue if it's really the
shortest match.

Here an example mbox (note that the length of ------ and ______ can vary):


From - Sun Sep 18 12:55:25 2005
(...)
Some text I want to keep
Some text I want to keep

-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development.....
________________________________________
this text should stay
email@address
https://address/should/stay


From - Sun Sep 18 12:58:18 2005
(...)
2 Some text I want to keep 2
2 Some text I want to keep 2

A cool diagram which needs to stay:
-------------------------------------
|This will be gone, too if I use |
|awk '/^----/,/^____/{next}{print}' |
|because of "------" above |
-------------------------------------

2 Some text I want to keep 2
2 Some text I want to keep 2


--------------------
SF.Net email is sponsored by:
some other
advertisement
_______________________________________________
this text should stay
email@address
https://address/should/stay


From - Thu Sep 22 16:00:04 2005
(...)
3 Some text I want to keep 3
3 Some text I want to keep 3

-----------------------------------
SF.Net email is sponsored by:
_________________________________________
this text should stay
email@address
https://address/should/stay

--
Tomasz Chmielewski
http://wpkg.org

loki harfagr

unread,
Nov 21, 2007, 2:30:16 PM11/21/07
to
On Wed, 21 Nov 2007 19:18:33 +0100, Tomasz Chmielewski wrote:

> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.

try those on your real data:

("should" be the shortest eq. to your expression)

$ awk '/^-+$/,/^_+$/{next}{print}'

or:

(still shorter *and* not oblivious ;-)

$ awk '/^-+$/,/^_*$/{next}{print}'

if any problem please provide more data and description of
the trouble :-)

Janis Papanagnou

unread,
Nov 21, 2007, 2:48:28 PM11/21/07
to
Tomasz Chmielewski wrote:
> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.

So your requirement seems to be; delete from the last ^[-]+$ until the
last ^[_]+$, and ignoring the effect of any such lines inside the ads
like...?

----
visible ads
----
removed ads
____


I see two more or less apparent solutions. You can print the data until
--- and delay further printing until another ---, if there's a ___ then
put subsequent lines in another store, ...etc. You see, quite dubious.
Or you can use external tools [off-topic for awk]; reverse the file using
the Unix command 'tac', then do the block filtering as above with pattern
ranges swapped, then do another 'tac'...

tac | awk '/^_+$/,/^-+$/{next}1' | tac

(Which seems much simpler a solution of the task than the other way.)

Janis

Janis Papanagnou

unread,
Nov 21, 2007, 2:55:23 PM11/21/07
to

BTW, even if your examples seem not to require that, it is advisable
to add logic to continue printing after the first erased block.

>
> (Which seems much simpler a solution of the task than the other way.)
>
> Janis
>

>>[examples snipped]

Janis Papanagnou

unread,
Nov 21, 2007, 3:07:13 PM11/21/07
to
Tomasz Chmielewski wrote:
> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.

Okay, then. Here *is* an on-topic awk solution.

{a[NR]=$0}
/^-+$/{from=NR}
/^_+$/{to=NR}
END{ if(from < to) {
for(i=1;i<from;i++) print a[i]
for(i=to+1;i<NR;i++) print a[i]
}
}

(If we can use tac to store the data we can as well let awk do that.)

Janis

Radoulov, Dimitre

unread,
Nov 21, 2007, 3:23:12 PM11/21/07
to


With GNU Awk you could try something like this:


awk 1 RS="-+\n[^-]**_+\n" filename

Dimitre

Radoulov, Dimitre

unread,
Nov 21, 2007, 3:30:46 PM11/21/07
to

Or just:

awk 1 RS="-+[^-]**_+" filename


Dimitre

Radoulov, Dimitre

unread,
Nov 21, 2007, 3:34:04 PM11/21/07
to
Radoulov, Dimitre wrote:
[...]

>> With GNU Awk you could try something like this:
>>
>>
>> awk 1 RS="-+\n[^-]**_+\n" filename
>
> Or just:
>
> awk 1 RS="-+[^-]**_+" filename

Sorry ...

awk 1 RS="-+[^-]*_+" filename


Dimitre

Radoulov, Dimitre

unread,
Nov 21, 2007, 3:58:15 PM11/21/07
to

... and of course this will fail if
the advertisement contains the "-" character,
so please ignore the post.


Dimitre

Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Tomasz Chmielewski

unread,
Nov 21, 2007, 4:15:52 PM11/21/07
to
Tomasz Chmielewski schrieb:

> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.

Thanks a lot for all answers in this thread! ;)

Steffen Schuler

unread,
Nov 22, 2007, 6:50:19 PM11/22/07
to
Hi Tomasz, hello netlanders,

On Wed, 21 Nov 2007 19:18:33 +0100, Tomasz Chmielewski wrote:

a working POSIX awk script is coded according to Janis's ideas:

/^_+$/ { pr(a, e-1)
del(a)
e = 0
next }
/^-+$/ { ++e; j = 1 }
e { a[e, j++] = $0; next }
1
END { pr(a, e) }

function pr(a, e, i, j) {
for (i = 1; i <= e; ++i) {
for (j = 1; (i,j) in a; ++j)
print a[i,j]
} }

function del(a, k) {
for (k in a)
delete a[k]
}

But the tac-awk solution of Janis is more elegant but not as efficient
than the above script. Janis's on-topic awk script doesn't work for
multiple ads.

Hope I could help,

Steffen "goedel" Schuler

Janis Papanagnou

unread,
Nov 22, 2007, 7:11:32 PM11/22/07
to
Steffen Schuler wrote:
>
> Janis's on-topic awk script doesn't work for multiple ads.

Indeed I misread the OP, assumed to process single mails
and missed that a whole mbox should be processed. :-/

Sorry.

Janis

Steffen Schuler

unread,
Nov 22, 2007, 11:02:01 PM11/22/07
to
Hello netlanders,

On Thu, 22 Nov 2007 23:50:19 +0000, Steffen Schuler wrote:

<snip>

<snip>

A shorter, simplified, and more space and time saving POSIX awk script is:

/^-+$/ { pr(); del(); addl(); next }
/^_+$/ && i { del(); next }
!i { print; next }
{ addl() }
END { pr() }

function addl() {
a[++i] = $0
}

function pr() {
for (i = 1; i in a; ++i)
print a[i]
}

function del() {
for (i in a)
delete a[i]
i = 0
}

Enjoy awk,

Steffen "goedel" Schuler

William James

unread,
Nov 23, 2007, 11:03:15 PM11/23/07
to
On Nov 21, 12:18 pm, Tomasz Chmielewski <t...@nospam.syneticon.net>
wrote:

> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.
>
> Here an example mbox (note that the length of ------ and ______ can vary):
>
> From - Sun Sep 18 12:55:25 2005
> (...)
> Some text I want to keep
> Some text I want to keep
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development.....
> ________________________________________
> this text should stay
> email@addresshttps://address/should/stay

>
> From - Sun Sep 18 12:58:18 2005
> (...)
> 2 Some text I want to keep 2
> 2 Some text I want to keep 2
>
> A cool diagram which needs to stay:
> -------------------------------------
> |This will be gone, too if I use |
> |awk '/^----/,/^____/{next}{print}' |
> |because of "------" above |
> -------------------------------------
>
> 2 Some text I want to keep 2
> 2 Some text I want to keep 2
>
> --------------------
> SF.Net email is sponsored by:
> some other
> advertisement
> _______________________________________________
> this text should stay
> email@addresshttps://address/should/stay

>
> From - Thu Sep 22 16:00:04 2005
> (...)
> 3 Some text I want to keep 3
> 3 Some text I want to keep 3
>
> -----------------------------------
> SF.Net email is sponsored by:
> _________________________________________
> this text should stay
> email@addresshttps://address/should/stay


Gawk:

BEGIN { RS = "\n---+\n(([^_-][^\n]*)?\n)*[-_]+\n" }

{ print
if ( RT ~ /-\n$/ )
print RT
}

---- output ----


From - Sun Sep 18 12:55:25 2005
(...)
Some text I want to keep
Some text I want to keep

this text should stay
email@address
https://address/should/stay

From - Sun Sep 18 12:58:18 2005
(...)
2 Some text I want to keep 2
2 Some text I want to keep 2

A cool diagram which needs to stay:

-------------------------------------
|This will be gone, too if I use |
|awk '/^----/,/^____/{next}{print}' |
|because of "------" above |
-------------------------------------


2 Some text I want to keep 2
2 Some text I want to keep 2

this text should stay
email@address
https://address/should/stay

From - Thu Sep 22 16:00:04 2005
(...)
3 Some text I want to keep 3
3 Some text I want to keep 3

this text should stay
email@address
https://address/should/stay

0 new messages