I have a mbox file (see below for an example) with advertisements
attached to the end of every message. I would like to remove these ads.
Ads are placed between -------- and _________ characters.
I tried using something like:
awk '/^----/,/^____/{next}{print}'
but it eats up a bit too much, and one could argue if it's really the
shortest match.
Here an example mbox (note that the length of ------ and ______ can vary):
From - Sun Sep 18 12:55:25 2005
(...)
Some text I want to keep
Some text I want to keep
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development.....
________________________________________
this text should stay
email@address
https://address/should/stay
From - Sun Sep 18 12:58:18 2005
(...)
2 Some text I want to keep 2
2 Some text I want to keep 2
A cool diagram which needs to stay:
-------------------------------------
|This will be gone, too if I use |
|awk '/^----/,/^____/{next}{print}' |
|because of "------" above |
-------------------------------------
2 Some text I want to keep 2
2 Some text I want to keep 2
--------------------
SF.Net email is sponsored by:
some other
advertisement
_______________________________________________
this text should stay
email@address
https://address/should/stay
From - Thu Sep 22 16:00:04 2005
(...)
3 Some text I want to keep 3
3 Some text I want to keep 3
-----------------------------------
SF.Net email is sponsored by:
_________________________________________
this text should stay
email@address
https://address/should/stay
--
Tomasz Chmielewski
http://wpkg.org
> Yes, everything is possible in awk, but not for mere mortals, right?
>
> I have a mbox file (see below for an example) with advertisements
> attached to the end of every message. I would like to remove these ads.
> Ads are placed between -------- and _________ characters.
>
> I tried using something like:
>
> awk '/^----/,/^____/{next}{print}'
>
> but it eats up a bit too much, and one could argue if it's really the
> shortest match.
try those on your real data:
("should" be the shortest eq. to your expression)
$ awk '/^-+$/,/^_+$/{next}{print}'
or:
(still shorter *and* not oblivious ;-)
$ awk '/^-+$/,/^_*$/{next}{print}'
if any problem please provide more data and description of
the trouble :-)
So your requirement seems to be; delete from the last ^[-]+$ until the
last ^[_]+$, and ignoring the effect of any such lines inside the ads
like...?
----
visible ads
----
removed ads
____
I see two more or less apparent solutions. You can print the data until
--- and delay further printing until another ---, if there's a ___ then
put subsequent lines in another store, ...etc. You see, quite dubious.
Or you can use external tools [off-topic for awk]; reverse the file using
the Unix command 'tac', then do the block filtering as above with pattern
ranges swapped, then do another 'tac'...
tac | awk '/^_+$/,/^-+$/{next}1' | tac
(Which seems much simpler a solution of the task than the other way.)
Janis
BTW, even if your examples seem not to require that, it is advisable
to add logic to continue printing after the first erased block.
>
> (Which seems much simpler a solution of the task than the other way.)
>
> Janis
>
>>[examples snipped]
Okay, then. Here *is* an on-topic awk solution.
{a[NR]=$0}
/^-+$/{from=NR}
/^_+$/{to=NR}
END{ if(from < to) {
for(i=1;i<from;i++) print a[i]
for(i=to+1;i<NR;i++) print a[i]
}
}
(If we can use tac to store the data we can as well let awk do that.)
Janis
With GNU Awk you could try something like this:
awk 1 RS="-+\n[^-]**_+\n" filename
Dimitre
Or just:
awk 1 RS="-+[^-]**_+" filename
Dimitre
Sorry ...
awk 1 RS="-+[^-]*_+" filename
Dimitre
... and of course this will fail if
the advertisement contains the "-" character,
so please ignore the post.
Dimitre
Thanks a lot for all answers in this thread! ;)
On Wed, 21 Nov 2007 19:18:33 +0100, Tomasz Chmielewski wrote:
a working POSIX awk script is coded according to Janis's ideas:
/^_+$/ { pr(a, e-1)
del(a)
e = 0
next }
/^-+$/ { ++e; j = 1 }
e { a[e, j++] = $0; next }
1
END { pr(a, e) }
function pr(a, e, i, j) {
for (i = 1; i <= e; ++i) {
for (j = 1; (i,j) in a; ++j)
print a[i,j]
} }
function del(a, k) {
for (k in a)
delete a[k]
}
But the tac-awk solution of Janis is more elegant but not as efficient
than the above script. Janis's on-topic awk script doesn't work for
multiple ads.
Hope I could help,
Steffen "goedel" Schuler
Indeed I misread the OP, assumed to process single mails
and missed that a whole mbox should be processed. :-/
Sorry.
Janis
On Thu, 22 Nov 2007 23:50:19 +0000, Steffen Schuler wrote:
<snip>
<snip>
A shorter, simplified, and more space and time saving POSIX awk script is:
/^-+$/ { pr(); del(); addl(); next }
/^_+$/ && i { del(); next }
!i { print; next }
{ addl() }
END { pr() }
function addl() {
a[++i] = $0
}
function pr() {
for (i = 1; i in a; ++i)
print a[i]
}
function del() {
for (i in a)
delete a[i]
i = 0
}
Enjoy awk,
Steffen "goedel" Schuler
Gawk:
BEGIN { RS = "\n---+\n(([^_-][^\n]*)?\n)*[-_]+\n" }
{ print
if ( RT ~ /-\n$/ )
print RT
}
---- output ----
From - Sun Sep 18 12:55:25 2005
(...)
Some text I want to keep
Some text I want to keep
this text should stay
email@address
https://address/should/stay
From - Sun Sep 18 12:58:18 2005
(...)
2 Some text I want to keep 2
2 Some text I want to keep 2
A cool diagram which needs to stay:
-------------------------------------
|This will be gone, too if I use |
|awk '/^----/,/^____/{next}{print}' |
|because of "------" above |
-------------------------------------
2 Some text I want to keep 2
2 Some text I want to keep 2
this text should stay
email@address
https://address/should/stay
From - Thu Sep 22 16:00:04 2005
(...)
3 Some text I want to keep 3
3 Some text I want to keep 3
this text should stay
email@address
https://address/should/stay