Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is awk suitable for this?

194 views
Skip to first unread message

no.to...@gmail.com

unread,
Sep 10, 2012, 12:46:38 PM9/10/12
to
We need these heuristics to trim out the repetative garbage
that even the minimalist lynx/links fetches from http-pages.

How would `awk` or other method do:-
f1 : FileOfLinesToDelete
from
f2 : DataFileToBeCleanedUpByDeletingLines.

And the next refinement, is that f1A contains RegexS on each line,
rather than <perfect matches>.

Pseudo-code might be:-

FOREACH LineOf(f1) DO
FOREACH LineOf(f2) DO
IF Matches( LineOf(f1), LineOf(f2))
THEN delete(LineOf(f2))
Endf2
Endf1

What about later optimisation/refinement by processing f1, so
that the ORDER of to-be-deleted-lines, relates to their contents?

== TIA.

Ed Morton

unread,
Sep 10, 2012, 12:59:06 PM9/10/12
to
no.to...@gmail.com <no.to...@gmail.com> wrote:

> We need these heuristics to trim out the repetative garbage
> that even the minimalist lynx/links fetches from http-pages.
>
> How would `awk` or other method do:-
> f1 : FileOfLinesToDelete
> from
> f2 : DataFileToBeCleanedUpByDeletingLines.

Assuming f1 contains text that appears as lines in f2 rather than line
numbers from f2:

awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2

> And the next refinement, is that f1A contains RegexS on each line,
> rather than <perfect matches>.

awk 'NR==FNR{del[$0]; next} {for (d in del) if ($0 ~ d) next}1' f1 f2

>
> Pseudo-code might be:-
>
> FOREACH LineOf(f1) DO
> FOREACH LineOf(f2) DO
> IF Matches( LineOf(f1), LineOf(f2))
> THEN delete(LineOf(f2))
> Endf2
> Endf1
>
> What about later optimisation/refinement by processing f1, so
> that the ORDER of to-be-deleted-lines, relates to their contents?

Probably trivial in awk as it's designed for text processing, certainly no
easier in any other tool/language. Post an example (input plus expected
output) if you'd like to see a solution.

Ed.
>
> == TIA.


Posted using www.webuse.net

Kaz Kylheku

unread,
Sep 10, 2012, 3:20:19 PM9/10/12
to
["Followup-To:" header set to comp.lang.awk.]
On 2012-09-10, no.to...@gmail.com <no.to...@gmail.com> wrote:
> We need these heuristics to trim out the repetative garbage
> that even the minimalist lynx/links fetches from http-pages.
>
> How would `awk` or other method do:-
> f1 : FileOfLinesToDelete
> from
> f2 : DataFileToBeCleanedUpByDeletingLines.
>
> And the next refinement, is that f1A contains RegexS on each line,
> rather than <perfect matches>.

Note that f1a can be mechanically translated to a bunch of sed commands
to delete lines.

/REGEX1/d
/REGEX2/d
/REGEX3/d

> FOREACH LineOf(f1) DO
> FOREACH LineOf(f2) DO
> IF Matches( LineOf(f1), LineOf(f2))
> THEN delete(LineOf(f2))
> Endf2
> Endf1

or: "sed -f scriptfile infile > outfile"

Dave Gibson

unread,
Sep 10, 2012, 4:02:23 PM9/10/12
to
[ Followup-To: set to comp.os.linux.misc ]

In comp.os.linux.misc, no.to...@gmail.com wrote:
> We need these heuristics to trim out the repetative garbage
> that even the minimalist lynx/links fetches from http-pages.
>
> How would `awk` or other method do:-
> f1 : FileOfLinesToDelete
> from
> f2 : DataFileToBeCleanedUpByDeletingLines.
>
> And the next refinement, is that f1A contains RegexS on each line,
> rather than <perfect matches>.
>
> Pseudo-code might be:-
>
> FOREACH LineOf(f1) DO
> FOREACH LineOf(f2) DO
> IF Matches( LineOf(f1), LineOf(f2))
> THEN delete(LineOf(f2))
> Endf2
> Endf1

grep -v -f f1 -- f2 > f3

Ed Morton

unread,
Sep 10, 2012, 4:29:33 PM9/10/12
to
Kaz Kylheku <k...@kylheku.com> wrote:

> ["Followup-To:" header set to comp.lang.awk.]
> On 2012-09-10, no.to...@gmail.com <no.to...@gmail.com> wrote:
> > We need these heuristics to trim out the repetative garbage
> > that even the minimalist lynx/links fetches from http-pages.
> >
> > How would `awk` or other method do:-
> > f1 : FileOfLinesToDelete
> > from
> > f2 : DataFileToBeCleanedUpByDeletingLines.
> >
> > And the next refinement, is that f1A contains RegexS on each line,
> > rather than <perfect matches>.
>
> Note that f1a can be mechanically translated to a bunch of sed commands
> to delete lines.
>
> /REGEX1/d
> /REGEX2/d
> /REGEX3/d

Just be careful how you handle any "/"s or other "special" characters in
your input file during that translation process if you decide to go that
route. Not sure what other characters (or situations) would be "special"
to sed, but I expect, for example, that a "\" at the end of an input line
would mess you up since you could end up with "/REGEX\/d" after
translation if you don't handle it and there may be more...

Ed.

>
> > FOREACH LineOf(f1) DO
> > FOREACH LineOf(f2) DO
> > IF Matches( LineOf(f1), LineOf(f2))
> > THEN delete(LineOf(f2))
> > Endf2
> > Endf1
>
> or: "sed -f scriptfile infile > outfile"
>


Posted using www.webuse.net

Ed Morton

unread,
Sep 10, 2012, 4:43:36 PM9/10/12
to
Dave Gibson <dave.gma...@googlemail.com.invalid> wrote:

> [ Followup-To: set to comp.os.linux.misc ]
>
> In comp.os.linux.misc, no.to...@gmail.com wrote:
> > We need these heuristics to trim out the repetative garbage
> > that even the minimalist lynx/links fetches from http-pages.
> >
> > How would `awk` or other method do:-
> > f1 : FileOfLinesToDelete
> > from
> > f2 : DataFileToBeCleanedUpByDeletingLines.
> >
> > And the next refinement, is that f1A contains RegexS on each line,
> > rather than <perfect matches>.
> >
> > Pseudo-code might be:-
> >
> > FOREACH LineOf(f1) DO
> > FOREACH LineOf(f2) DO
> > IF Matches( LineOf(f1), LineOf(f2))
> > THEN delete(LineOf(f2))
> > Endf2
> > Endf1
>
> grep -v -f f1 -- f2 > f3

Just be careful that you want to delete from f2 lines that CONTAIN the
lines listed in f1 rather than the f2 lines that completely match the
lines listed in f1. With the above if f1 contains the text "the" then
every line in f2 that contains the text "the" (e.g. "their", "father",
etc.) will be deleted rather than just lines that are exactly just the
word "the". Of course, you can fix that by adding a "^" at the start and
"$" at the end of every line in f1. When you get to your enhancement of
looking for Regexs you'll need to consider that in any tool.

Like with the sed solution though, if you go the grep route for your first
step or 2 above, then you will almost certainly want to switch to awk for
your next step below (or any other enhancement) anyway.

> > What about later optimisation/refinement by processing f1, so
> > that the ORDER of to-be-deleted-lines, relates to their contents?

Regards,

Ed.

Posted using www.webuse.net

no.to...@gmail.com

unread,
Sep 10, 2012, 5:04:50 PM9/10/12
to
In article <201209101...@kylheku.com>, Kaz Kylheku <k...@kylheku.com> wrote:

> ["Followup-To:" header set to comp.lang.awk.]
> On 2012-09-10, no.to...@gmail.com <no.to...@gmail.com> wrote:
> > We need these heuristics to trim out the repetative garbage
> > that even the minimalist lynx/links fetches from http-pages.
> >
> > How would `awk` or other method do:-
> > f1 : FileOfLinesToDelete
> > from
> > f2 : DataFileToBeCleanedUpByDeletingLines.
> >
> > And the next refinement, is that f1A contains RegexS on each line,
> > rather than <perfect matches>.
>
> Note that f1a can be mechanically translated to a bunch of sed commands
> to delete lines.
Yes manually.
But mechanically requires a <compiler construction first>.
>
> /REGEX1/d
> /REGEX2/d
> /REGEX3/d
>
> > FOREACH LineOf(f1) DO
> > FOREACH LineOf(f2) DO
> > IF Matches( LineOf(f1), LineOf(f2))
> > THEN delete(LineOf(f2))
> > Endf2
> > Endf1
>
> or: "sed -f scriptfile infile > outfile"
> .
OK, previously I had a serial set of sed , including:
<delete allLines Inclusively From Regex1
upto Regex2>

Which was OK for a while, until it started deleting good
sections. So then I wanted to add: `but only provided
LineOfRegex2 - LineOfRegex1 < N `.
But I don't know how/if to do that.
And since awk is so versatile, it could certainly do it.
Besides, more importantly, this new proposed method,
being data-driven is more flexible.

My first efforts which were for cleaning google-fetched
Usenet, were lost, after google changed the format.

It's just occured to me that awk could do:
<delete all-except-the-1st-one of
consecutive sequences of lines
which are repeated>

So when you fetch a sequence of articles from a http-site,
via lynx/links and append them to a file, to make a book
of N chapers [where, here each char represents a line],
when your fetch&Append file gives you:
abc123|abc456|abc789|abc....
awk should filter this to:
abc123|456|789|....

It's staggering what they do with awk.
But it's a dog to learn.



no.to...@gmail.com

unread,
Sep 10, 2012, 5:05:01 PM9/10/12
to
Ed Morton's post was on google but not yet on eternal-september
so, I'll patch in the goog version , without the <thread link>

> > How would `awk` or other method do:-
> > f1 : FileOfLinesToDelete
> > from
> > f2 : DataFileToBeCleanedUpByDeletingLines.

> Assuming f1 contains text that appears as lines in f2 rather than line
> numbers from f2:
yes
> awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2
>
That's good! Except *I* can't decode it.

> > And the next refinement, is that f1A contains RegexS on each line,
> > rather than <perfect matches>.
>
> awk 'NR==FNR{del[$0]; next} {for (d in del) if ($0 ~ d) next}1' f1 f2

Nor can I decode this. Which didn't delete lines like:-
[49]Print | [50]Individual message | [51]Show original | [52]Report
although it DID delete lines like:-
this message | [53]Find messages by this author

So an example of a Regex would be: for matching,
ignore white-chars and <digits between square brackets>

> > What about later optimisation/refinement by processing f1, so
> > that the ORDER of to-be-deleted-lines, relates to their contents?
> Probably trivial in awk as it's designed for text processing, certainly no
> easier in any other tool/language. Post an example (input plus expected
> output) if you'd like to see a solution.

No, I meant eg. f1 =
mable
table
able
needs to use 3 delete-loops, but if it was optimised, to
able
then it would delete all 3 lines in one-loop.

But this thinking REALLY demonstrates that/how premature
optimisation is absurd, since awk is so fast already.

Thanks.

== Chris Glur.

Kaz Kylheku

unread,
Sep 10, 2012, 5:46:07 PM9/10/12
to
On 2012-09-10, no.to...@gmail.com <no.to...@gmail.com> wrote:
> In article <201209101...@kylheku.com>, Kaz Kylheku <k...@kylheku.com> wrote:
>> On 2012-09-10, no.to...@gmail.com <no.to...@gmail.com> wrote:
>> > We need these heuristics to trim out the repetative garbage
>> > that even the minimalist lynx/links fetches from http-pages.
>> Note that f1a can be mechanically translated to a bunch of sed commands
>> to delete lines.
> Yes manually.
> But mechanically requires a <compiler construction first>.

What?

> So when you fetch a sequence of articles from a http-site,
> via lynx/links and append them to a file, to make a book

lynx for web scraping? That is silly; you can use wget or curl,
and the pipe the output straight to your text processing utilities.

> of N chapers [where, here each char represents a line],
> when your fetch&Append file gives you:
> abc123|abc456|abc789|abc....
> awk should filter this to:
> abc123|456|789|....

I made a language that is ideal for scraping complicated web pages
with semi-regular structure: http://www.nongnu.org/txr

What is the URL of this site that you're scraping? And do you have an
example of what you want to scrape out and roughly in what format?

> It's staggering what they do with awk.
> But it's a dog to learn.

Hardly, just to use.

If the problem involves neat data which can be delimited into records that
contain rows, awk shines. All of that delimiting and looping is implicit in
awk's behavior and so the code that you write expresses only what happens with
the entries in those rows.

The more the problem deviates from that structure, the less suited it
is for awk. Especially if you need to do a proper "software engineering" job
of it, which means handling and diagnosing all nonconforming inputs.

no.to...@gmail.com

unread,
Sep 10, 2012, 5:51:19 PM9/10/12
to
In article <201209102...@webuse.net>, nobody wrote:

> Kaz Kylheku <k...@kylheku.com> wrote:
> > Note that f1a can be mechanically translated to a bunch of sed commands
> > to delete lines.
> >
> > /REGEX1/d
> > /REGEX2/d
> > /REGEX3/d
>
I think Ed said:-
> Just be careful how you handle any "/"s or other "special" characters in
> your input file during that translation process if you decide to go that
> route. Not sure what other characters (or situations) would be "special"
> to sed, but I expect, for example, that a "\" at the end of an input line
> would mess you up since you could end up with "/REGEX\/d" after
> translation if you don't handle it and there may be more...
>
Can you do the Regex for match ignoring space & tab and digits?

Dave Gibson wrote:-
grep -v -f f1 -- f2 > f3

Here's the results:
-> ls -l f* ==
-rw-r--r-- 1 root root 279 2012-09-10 20:23 f1
-rw-r--r-- 1 root root 7719 2012-09-10 20:22 f2
-rw-r--r-- 1 root root 7719 2012-09-10 23:15 f3
-rw-r--r-- 1 root root 6946 2012-09-10 20:35 f31
-rw-r--r-- 1 root root 7413 2012-09-10 20:36 f32
of 3 versions:-
grep -v -f f1 -- f2 > f3

awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2 > f31

awk 'NR==FNR{del[$0]; next} \
{for (d in del) if ($0 ~ d) next}1' f1 f2 > f32

I can't see any problem with f31
----------
We need knowledge more than code-to-test.

== Chris Glur.




Chris Davies

unread,
Sep 10, 2012, 5:21:37 PM9/10/12
to
no.to...@gmail.com wrote:
>> Note that f1a can be mechanically translated to a bunch of sed commands
>> to delete lines.
> Yes manually.
> But mechanically requires a <compiler construction first>.

You can use sed to provide the mechanism.

sed 's!\(.*\)!/\1/d!'

But on balance I think I prefer the grep solution posted elsewhere in
the thread.

Chris

no.to...@gmail.com

unread,
Sep 10, 2012, 6:49:52 PM9/10/12
to
In article <201209102...@webuse.net>, "Ed Morton" <morto...@gmail.com> wrote:

> Dave Gibson <dave.gma...@googlemail.com.invalid> wrote:
>
> From: "Ed Morton" <morto...@gmail.com>
> > grep -v -f f1 -- f2 > f3
>
> Just be careful that you want to delete from f2 lines that CONTAIN the
> lines listed in f1 rather than the f2 lines that completely match the
> lines listed in f1.

No! Apparently MY specification was WRONG.
Natural english is really crap.
The intention is to delete whole lines.

It's too-much to expect to just delete part of lines,
but it's ........
---------------
I'm not going to delete the above to hide my confusion.
But I'll try to make the excuse that this is just a side-kick of
other problems, that I'm working on.

You've read the purpose of the utility:
to delete redundant lines.

That mean whole lines.
But because the lines are derived from lynx/links,
often the <URL-indices-in-square-brackets>
vary. As per my example given.

Previously when I used the series-of-sed-method,
I selected <key strings in the lines>.

Now if I read my initial spec, of this thread, the
intention was to FIRST match full-lines, and then
refine it to Regex matching part of the line.

But perhaps that's not a refinement but requires
a completely different approach?

Please consider the TOP aim of this interesting problem,
where my proposed algorithm/description my be improved
on to reach the described final goal.

For the one application that I described, where you get
multiple web-pages, each having the same envelope/blurb,
from a site; I would just copy the first redundant-blurb to f1
to be used to clean f2: the later pages, all appended together.

But if a 'page' used a different leading-tab/space-count,
lines would escape being deleted. So the refinement of
regex-matching would be good. And even essential
for the lynx-URL-indices typefile.

Perhaps this problem belongs to the horse-and-rider
family, where you get visual feed-back and guide it?

Thanks,

== Chris Glur.

Ed Morton

unread,
Sep 10, 2012, 7:01:00 PM9/10/12
to
On 9/10/2012 4:05 PM, no.to...@gmail.com wrote:
> Ed Morton's post was on google but not yet on eternal-september
> so, I'll patch in the goog version , without the <thread link>
>
>>> How would `awk` or other method do:-
>>> f1 : FileOfLinesToDelete
>>> from
>>> f2 : DataFileToBeCleanedUpByDeletingLines.
>
>> Assuming f1 contains text that appears as lines in f2 rather than line
>> numbers from f2:
> yes
>> awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2
>>
> That's good! Except *I* can't decode it.

RTFM? It's pretty simple stuff given a couple of basic concepts.

>> > And the next refinement, is that f1A contains RegexS on each line,
>> > rather than <perfect matches>.
>>
>> awk 'NR==FNR{del[$0]; next} {for (d in del) if ($0 ~ d) next}1' f1 f2
>
> Nor can I decode this.

It's just a condition with an action and then a loop with a condition. Again,
very simple stuff.

Which didn't delete lines like:-
> [49]Print | [50]Individual message | [51]Show original | [52]Report
> although it DID delete lines like:-
> this message | [53]Find messages by this author

Output without the input that produced it isn't very useful. If the above lines
were/weren't deleted then your first input file simply didn't/did match those
lines in an RE comparison.

> So an example of a Regex would be: for matching,
> ignore white-chars and <digits between square brackets>

OK, that's a totally different question to the one you had asked. You're no
longer asking "how do I compare lines in f2 against a file of REs?" you're now
asking "how do I construct REs that will ignore white-space and digits between
square brackets?" which in itself is an unclear question. Again, some sample
input and the expected output for that input would help but given the little bit
of output you show above this might give you an idea on what an RE might look
like in f1 to delete a line that contains " [51]Show original " at the start or
end of a line or between "|"s:

(^|[|])[[:blank:]]*[[][[:digit:]]+[]]Show[[:blank:]]+original[[:blank:]]*([|]|$)

>> > What about later optimisation/refinement by processing f1, so
>> > that the ORDER of to-be-deleted-lines, relates to their contents?
>> Probably trivial in awk as it's designed for text processing, certainly no
>> easier in any other tool/language. Post an example (input plus expected
>> output) if you'd like to see a solution.
>
> No, I meant eg. f1 =
> mable
> table
> able
> needs to use 3 delete-loops, but if it was optimised, to
> able
> then it would delete all 3 lines in one-loop.
>
> But this thinking REALLY demonstrates that/how premature
> optimisation is absurd, since awk is so fast already.

Now I'm lost. Are you asking how to do something in awk or making a statement
about the way your input file is structured or something else? Honestly, for
more help I think you really need to post small samples of your 2 input files
along with the expected output.

Ed.

>
> Thanks.
>
> == Chris Glur.
>


Ed Morton

unread,
Sep 10, 2012, 8:40:46 PM9/10/12
to
On 9/10/2012 4:21 PM, Chris Davies wrote:
> no.to...@gmail.com wrote:
>>> Note that f1a can be mechanically translated to a bunch of sed commands
>>> to delete lines.
>> Yes manually.
>> But mechanically requires a <compiler construction first>.
>
> You can use sed to provide the mechanism.
>
> sed 's!\(.*\)!/\1/d!'

You'd need more than that, e.g.:

$ cat file
dir/file
$ sed 's!\(.*\)!/\1/d!' file
/dir/file/d

I don't think sed -f on a file containing the above output would do what you want.

> But on balance I think I prefer the grep solution posted elsewhere in
> the thread.

It's better than trying to use sed but has it's own issues (see else-thread) and
is not extensible as the OP seems to want.

Ed.

Ed Morton

unread,
Sep 10, 2012, 8:55:09 PM9/10/12
to
On 9/10/2012 5:49 PM, no.to...@gmail.com wrote:
<snip>
> You've read the purpose of the utility:
> to delete redundant lines.

Hang on, I may be seeing the light. Are you saying that you just want to take a
file and remove all lines where the key information already occurred from a
previously seen line?

If that's so and given what you said else-thread about wanting to ignore white
space and numbers between square brackets, maybe this is what you want:

$ cat file
[123] foo bar
[432] this stuff
[987] foo bar
[1462872] this stuff here
[1462872] this stuff

$ cat tst.awk
{
key = $0
gsub(/([[][[:digit:]]+[]]|[[:space:]]+)/," ",key)
}
!seen[key]++

$ awk -f tst.awk file
[123] foo bar
[432] this stuff
[1462872] this stuff here

Regards,

Ed.

Kaz Kylheku

unread,
Sep 11, 2012, 12:02:38 AM9/11/12
to
["Followup-To:" header set to comp.lang.awk.]
On 2012-09-11, Ed Morton <morto...@gmail.com> wrote:
> On 9/10/2012 4:21 PM, Chris Davies wrote:
>> no.to...@gmail.com wrote:
>>>> Note that f1a can be mechanically translated to a bunch of sed commands
>>>> to delete lines.
>>> Yes manually.
>>> But mechanically requires a <compiler construction first>.
>>
>> You can use sed to provide the mechanism.
>>
>> sed 's!\(.*\)!/\1/d!'
>
> You'd need more than that, e.g.:

Not necessarily. It depends on how the regex language is specified which
is stored in that file. That it contains regexes is already a given.

Someone had to produce the regexes and that someone could be aware that
it's going into sed.

> $ cat file
> dir/file
> $ sed 's!\(.*\)!/\1/d!' file
> /dir/file/d
>
> I don't think sed -f on a file containing the above output would do what you want.
>
>> But on balance I think I prefer the grep solution posted elsewhere in
>> the thread.
>
> It's better than trying to use sed but has it's own issues (see else-thread) and

Including the same issue. If you don't trust the regex language in that
file to be entire compatible with grep, then you have to filter it.

Ed Morton

unread,
Sep 11, 2012, 1:16:31 AM9/11/12
to
On 9/10/2012 11:02 PM, Kaz Kylheku wrote:
> ["Followup-To:" header set to comp.lang.awk.]
> On 2012-09-11, Ed Morton <morto...@gmail.com> wrote:
>> On 9/10/2012 4:21 PM, Chris Davies wrote:
>>> no.to...@gmail.com wrote:
>>>>> Note that f1a can be mechanically translated to a bunch of sed commands
>>>>> to delete lines.
>>>> Yes manually.
>>>> But mechanically requires a <compiler construction first>.
>>>
>>> You can use sed to provide the mechanism.
>>>
>>> sed 's!\(.*\)!/\1/d!'
>>
>> You'd need more than that, e.g.:
>
> Not necessarily. It depends on how the regex language is specified which
> is stored in that file. That it contains regexes is already a given.

That was an extension the OP wanted to consider. In his initial request he
wanted a list of text strings in his first file that'd exactly match the lines
in his second file. There were no regexes in the first file in that scenario.

> Someone had to produce the regexes and that someone could be aware that
> it's going into sed.

Sure but it's all getting a bit more deck-of-cards than necessary at that point.
Anyway, I actually think we're all getting thrown off track by the OP describing
his idea of a solution to the problem rather than just describing the problem.

Ed.

no.to...@gmail.com

unread,
Sep 11, 2012, 4:47:08 AM9/11/12
to
In article <k2lrfc$ttl$1...@dont-email.me>, nobody wrote:

> On 9/10/2012 4:05 PM, no.to...@gmail.com wrote:
> > Ed Morton's post was on google but not yet on eternal-september
> > so, I'll patch in the goog version , without the <thread link>
> >
> >>> How would `awk` or other method do:-
> >>> f1 : FileOfLinesToDelete
> >>> from
> >>> f2 : DataFileToBeCleanedUpByDeletingLines.
> >
> >> Assuming f1 contains text that appears as lines in f2 rather than line
> >> numbers from f2:
> > yes
> >> awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2
> >>
> > That's good! Except *I* can't decode it.

]RTFM? It's pretty simple stuff given a couple of basic concepts.
-------- tests:--
awk 'NR==FNR{print FNR}' f1 f2 <-- prints 1...5 == f1's line numbers

awk 'NR==FNR{print NR}' f1 f2 <-- same out put

awk 'NR==FNR{print $NR}' f1 f2 == outputs 4 lines:
[49]Print
message
[60]Individual
[63]Find

of partial contents of f1's 5 line contents:--
[49]Print | [50]Individual message | [51]Show original | [52]Report
this message | [53]Find messages by this author
[59]Print | [60]Individual message | [61]Show original | [62]Report
this message | [63]Find messages by this author
Posted using [66]www.webuse.net

That does NOT seem right ?!
-------- end of tests:-

>> awk 'NR==FNR{del[$0]; next} {for (d in del) if ($0 ~ d) next}1' f1 f2
>
> Nor can I decode this.

]It's just a condition
NR==FNR :: WHILE NR==FNR

]with an action
{del[$0]; next} :: construct the arrayOfMatchingLines

]and then a loop with a condition.
----------------
> So an example of a Regex would be: for matching,
> ignore white-chars and <digits between square brackets>

]OK, that's a totally different question to the one you had asked.

]You're no longer asking "how do I compare lines in f2 against a file of REs?"
Yes. = Question A

]you're now asking "how do I construct REs that will ignore white-space
]and digits between square brackets?" which in itself is an unclear question.
Yes. = Question B

For this, refined version of the original filter, answers to both questions
are required.

]Again, some sample input and the expected output for that input
]would help
f2 = <your paste of this post>
f1 = 3 lines:-
the cat sat on the mat
URL link number [43] continue
would help
So the 3rd line of f1, appears twice in f2, and both should be
deleted from the output [in the simplified version]. Which I believe
your 1st one-liner would do.

]but given the little bit of output you show above this might give you
]an idea on what an RE might look like in f1 to delete a line that contains
]" [51]Show original " at the start or end of a line or between "|"s:

](^|[|])[[:blank:]]*[[][[:digit:]]+[]]Show[[:blank:]]+original[[:blank:]]*([|]|$)

Thanks, I'll test that in some kind of <sed harness>. But to use it, I need
the answer to the OTHER question: <how to write the awk to use the Regex>.
-------------- snip
]Now I'm lost. Are you asking how to do something in awk or making a statement
]about the way your input file is structured or something else? Honestly, for
]more help I think you really need to post small samples of your 2 input files
]along with the expected output.

Well I won't say RTFM, but if you consider my original algorithm-hint
of 2 nested loops:
if f1 was 'mechanically optimised' to the single line of "able",
then it would need to loop only once instead of thrice,
since "able" matches all three of: "mable", "table", "able".
But let's not go there.
---------- snip
> But on balance I think I prefer the grep solution posted elsewhere in
> the thread.

]It's better than trying to use sed but has it's own issues (see else-thread) and
]is not extensible as the OP seems to want.

Because the end-product is human-read-text, which is not an
exact science, it's probably best to use an incremental improvement
approach. Besides 'successive refinement' is almost always a good
approach to software development.
------------ snip
> Not necessarily. It depends on how the regex language is specified which
> is stored in that file. That it contains regexes is already a given.

]That was an extension the OP wanted to consider. In his initial request he
]wanted a list of text strings in his first file that'd exactly match the lines
]in his second file. There were no regexes in the first file in that scenario.

Correct. And I hoped/assumed 'refinement' would give a more powerfull
version. But the simplified version test OK, so far. It's having to do a refresher
course on `man awk`, that seems so wasteful. OTOH it's understandable that
the one-line-hip-shooting-awk-jokeys don't WANT others to easily know how
to use awk. If you examine the use of awk in the *nix scripts, it's only used
to <give me the Nth white-char-separated-field>.
Which proves that the old timers have worked out that unless you're a
dedicated awk user, it's not worth the investment to use it's capabilities.

> Someone had to produce the regexes and that someone could be aware that
> it's going into sed.

]Sure but it's all getting a bit more deck-of-cards than necessary at that point.
]Anyway, I actually think we're all getting thrown off track by the OP describing
]his idea of a solution to the problem rather than just describing the problem.

I gave the top-level-spec as <deleting duplicate lines in a file of appended
http-fetches, all from the same 'web-site'>. And ADDITIONALLY I gave a
pseudo-code representation of a possible algorithm.

Thanks,

== Chris Glur.




Ed Morton

unread,
Sep 11, 2012, 8:52:15 AM9/11/12
to
On 9/11/2012 3:47 AM, no.to...@gmail.com wrote:> In article
<k2lrfc$ttl$1...@dont-email.me>, nobody wrote:
>
>> On 9/10/2012 4:05 PM, no.to...@gmail.com wrote:
>> > Ed Morton's post was on google but not yet on eternal-september
>> > so, I'll patch in the goog version , without the <thread link>
>> >
>> >>> How would `awk` or other method do:-
>> >>> f1 : FileOfLinesToDelete
>> >>> from
>> >>> f2 : DataFileToBeCleanedUpByDeletingLines.
>> >
>> >> Assuming f1 contains text that appears as lines in f2 rather than line
>> >> numbers from f2:
>> > yes
>> >> awk 'NR==FNR{del[$0]; next} !($0 in del)' f1 f2
>> >>
>> > That's good! Except *I* can't decode it.
>
> ]RTFM? It's pretty simple stuff given a couple of basic concepts.
> -------- tests:--
> awk 'NR==FNR{print FNR}' f1 f2 <-- prints 1...5 == f1's line numbers

print the number of lines read so far in just the currently open file

> awk 'NR==FNR{print NR}' f1 f2 <-- same out put

print the number of lines read so far across all files

> awk 'NR==FNR{print $NR}' f1 f2 == outputs 4 lines:

print the field in the currently open files indexed by the current line number
across all files. This doesn't really make sense to do.

> [49]Print

Field 1 in Line 1 of f1 below

> message

Field 2 in Line 2 of f1 below

> [60]Individual

Field 3 in Line 3 of f1 below

> [63]Find

Field 4 in Line 4 of f1 below

You would have seen a blank line printed at the end of the above since you'd be
telling awk to print Field 5 in Line 5 of f1 below but line 5 only has 3 fields
so Field 5 would be the empty string.

> of partial contents of f1's 5 line contents:--
> [49]Print | [50]Individual message | [51]Show original | [52]Report
> this message | [53]Find messages by this author
> [59]Print | [60]Individual message | [61]Show original | [62]Report
> this message | [63]Find messages by this author
> Posted using [66]www.webuse.net
>
> That does NOT seem right ?!

It is right, see above. NR is the number of records (lines of the
record-separator is a newline char) read so far across all files. FNR is the
same thing but only across the currently open file. FNR and NR have the same
value in the first file only, assuming that file is not empty, see the
pseudo-code I added below.

> -------- end of tests:-
>
> >> awk 'NR==FNR{del[$0]; next} {for (d in del) if ($0 ~ d) next}1' f1 f2
> >
> > Nor can I decode this.
>
> ]It's just a condition
> NR==FNR :: WHILE NR==FNR

If you want pseudo-code, what awk is doing is more like:

NR==FNR { <perform an action> } ::

NR=0
FOR file in files
DO
FNR=0
WHILE read $0
DO
split $0 into $1, $2, $3, etc.
NR++
FNR++
IF (NR==FNR) <<<<<< Your written condition goes here
THEN
<perform an action> <<<<<< Your associated action goes here
ENDIF
ENDWHILE
ENDFOR

>
> ]with an action
> {del[$0]; next} :: construct the arrayOfMatchingLines

Right.
Many seds only support BREs, the above is an ERE as supported by all awks.

But to use it, I need
> the answer to the OTHER question: <how to write the awk to use the Regex>.
> -------------- snip
> ]Now I'm lost. Are you asking how to do something in awk or making a statement
> ]about the way your input file is structured or something else? Honestly, for
> ]more help I think you really need to post small samples of your 2 input files
> ]along with the expected output.
>
> Well I won't say RTFM, but if you consider my original algorithm-hint
> of 2 nested loops:
> if f1 was 'mechanically optimised' to the single line of "able",
> then it would need to loop only once instead of thrice,
> since "able" matches all three of: "mable", "table", "able".
> But let's not go there.

Right because then "able" would match on a slew of words you DON'T want matched,
like "fable", "cable", etc. so you can't do that.

> ---------- snip
>> But on balance I think I prefer the grep solution posted elsewhere in
>> the thread.
>
> ]It's better than trying to use sed but has it's own issues (see else-thread) and
> ]is not extensible as the OP seems to want.
>
> Because the end-product is human-read-text, which is not an
> exact science, it's probably best to use an incremental improvement
> approach. Besides 'successive refinement' is almost always a good
> approach to software development.

That's often true in coding but not in architecture/design. To come up with a
good architecture/design you need to understand the current requirements
thoroughly AND have a good idea of what enhancements will be required in future.
If you don't do that then you end up with the SW-architectural equivalent of a
thatched cottage sticking out half way up the Eiffel Tower. Right now we're at
the architecture/design stage as we're trying to figure out really high level
stuff like "do we parse 1 file or 2?"

> ------------ snip
>> Not necessarily. It depends on how the regex language is specified which
>> is stored in that file. That it contains regexes is already a given.
>
> ]That was an extension the OP wanted to consider. In his initial request he
> ]wanted a list of text strings in his first file that'd exactly match the lines
> ]in his second file. There were no regexes in the first file in that scenario.
>
> Correct. And I hoped/assumed 'refinement' would give a more powerfull
> version. But the simplified version test OK, so far. It's having to do a
refresher
> course on `man awk`, that seems so wasteful. OTOH it's understandable that
> the one-line-hip-shooting-awk-jokeys don't WANT others to easily know how
> to use awk. If you examine the use of awk in the *nix scripts, it's only used
> to <give me the Nth white-char-separated-field>.
> Which proves that the old timers have worked out that unless you're a
> dedicated awk user, it's not worth the investment to use it's capabilities.

The above may be how things are where you work but you're way off in general.
That's OT for this forum anyway, though.

>> Someone had to produce the regexes and that someone could be aware that
>> it's going into sed.
>
> ]Sure but it's all getting a bit more deck-of-cards than necessary at that point.
> ]Anyway, I actually think we're all getting thrown off track by the OP describing
> ]his idea of a solution to the problem rather than just describing the problem.
>
> I gave the top-level-spec as <deleting duplicate lines in a file of appended
> http-fetches, all from the same 'web-site'>. And ADDITIONALLY I gave a
> pseudo-code representation of a possible algorithm.

The thing I'm struggling with is trying to understand what the purpose of having
2 files is given what you say you want to do.

If you just want to delete duplicates in a file ignoring white space and digits
between square brackets as you say above, then that's just a matter of
constructing a key from the important parts of each line and skipping those
lines where the constructed key has appeared before, e.g.:

$ cat file
[123] foo bar
[432] this stuff
[987] foo bar
[1462872] this stuff here
[1462872] this stuff

$ cat tst.awk
{
key = $0
gsub(/([[][[:digit:]]+[]]|[[:space:]]+)/," ",key)
}
!seen[key]++

$ awk -f tst.awk file
[123] foo bar
[432] this stuff
[1462872] this stuff here

If that isn't what you want then please just provide a more representative,
small (~10 lines or so) input file and the expected output and explain the mapping.

Ed.
>
> Thanks,
>
> == Chris Glur.
>
>
>
>


no.to...@gmail.com

unread,
Sep 11, 2012, 2:59:28 PM9/11/12
to
In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:

> On 9/11/2012 3:47 AM, no.to...@gmail.com wrote:> In article
> <k2lrfc$ttl$1...@dont-email.me>, nobody wrote:
--snip
> The above may be how things are where you work but you're way off in general.
> That's OT for this forum anyway, though.

I'm REtired, but I'm not yet tired.
---
> The thing I'm struggling with is trying to understand what the
> purpose of having 2 files is given what you say you want to do.

There are 4 files: 1=input-text-data; 2=script;
3= <args> which are applicable to the input; 4=output

For the first 'family of tasks' (not those with "[123]")
where eg. a 44-line header appears, repeated 10 times, for the
10 web-pages, from the same site, whch have been appended to
the input-text-data file,
I just need to copy the 1st repeating-header-44-lines to the <arg> file,
and say `DoIt`;
which will call the script, to use the <arg> file, to delete the 10 blocks.
And then copying the <arg> file, into the output-file, at the appropriate
position, gives the 10 chapters of the book, with the 44-line-blurb,
in the 1st chapter ONCE only.

So with this family of data-files, you don't need to manually key in the
args. You just paste/block-copy them to the arg-file.

It looks as if your first script does this OK.
I've tested it on a 150KB text-file, which I'll read later, in detail.
----------------
> If you just want to delete duplicates in a file ignoring white space and digits
> between square brackets as you say above,

yes, but only those lines which I've selected for the <delete-arg> file.

> then that's just a matter of
> constructing a key from the important parts of each line and skipping those
> lines where the constructed key has appeared before, e.g.:
>
> $ cat file
> [123] foo bar
> [432] this stuff
> [987] foo bar
> [1462872] this stuff here
> [1462872] this stuff
>
> $ cat tst.awk
> {
> key = $0
> gsub(/([[][[:digit:]]+[]]|[[:space:]]+)/," ",key)
> }
> !seen[key]++
>
> $ awk -f tst.awk file
> [123] foo bar
> [432] this stuff
> [1462872] this stuff here
>
Without trying to understand your code, this works, for your interpretation;
but my example line/s didn't all START with "["<digit>.
So eg. if I use :
del [123] foo bar
[432] this stuff
del [987] foo bar
then it fails.

> If that isn't what you want then please just provide a more representative,
> small (~10 lines or so) input file and the expected output and explain the mapping.

No. I've got a private war against <explaining by examples> which is a bad as
instructions like: klikA, klikB, klikC...

Because this is a non deterministic task, I propose the horse-and-rider
method; where you get visual feedback from the OS and you guide it.

So the key aspect that I failed to explain, is that the user/me looks
at the input-file and sees 'these N-consecutive lines' are repeated
garbage; so perhaps keep the first copy, but delete all the others.

this heuristic is possible because webpages are made by boys
from the throw-away magazine industry, and they just use templates
which they populate with the minimum of valuable info.

If you look at the emails these days? A 2 line contents is packaged
in 200-lines of garbage.

> Ed.
> >
> > Thanks,
> >
> > == Chris Glur.
I'll study your MUCH VALUED pseudo-code later.


Ed Morton

unread,
Sep 11, 2012, 3:25:28 PM9/11/12
to
In what way does it fail? There's nothing in my script that assumes your lines
will start with "["<digit>.

>
> > If that isn't what you want then please just provide a more representative,
> > small (~10 lines or so) input file and the expected output and explain the
mapping.
>
> No. I've got a private war against <explaining by examples> which is a bad as
> instructions like: klikA, klikB, klikC...

I'm not asking for examples that I could synthesize your requirements from, just
a Use Case to help clarify your textual requirements. Use Cases have proven
tremendously useful over the years in requirements specification and I think
they'd be of ENORMOUS benefit here.

>
> Because this is a non deterministic task, I propose the horse-and-rider
> method; where you get visual feedback from the OS and you guide it.
>
> So the key aspect that I failed to explain, is that the user/me looks
> at the input-file and sees 'these N-consecutive lines' are repeated
> garbage; so perhaps keep the first copy, but delete all the others.
>
> this heuristic is possible because webpages are made by boys
> from the throw-away magazine industry, and they just use templates
> which they populate with the minimum of valuable info.
>
> If you look at the emails these days? A 2 line contents is packaged
> in 200-lines of garbage.
>
> > Ed.
> > >
> > > Thanks,
> > >
> > > == Chris Glur.
> I'll study your MUCH VALUED pseudo-code later.
>

The good news is that I'm sure your problem has a trivial solution, the bad news
is I can't seem to fully grasp what that problem is! So, I do hope there's
something in what I've posted so far in the thread that helps, but without some
concrete example to help me better understand what you're trying to do, I'm
afraid I need to bow out and leave it to others (maybe someone who understands
your problem domain if not your specific problem) to help you further.

All the best,

Ed.


Posted using www.webuse.net

no.to...@gmail.com

unread,
Sep 12, 2012, 4:13:58 PM9/12/12
to
In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:

> On 9/11/2012 3:47 AM, no.to...@gmail.com wrote:> In article
> <k2lrfc$ttl$1...@dont-email.me>, nobody wrote:

Ed Morton's first suggestion:-

awk 'NR==FNR{del[$0]; next} !($0 in del)'\
TextBloksDelete Input > Output

tests well, on real data files.
---
> The thing I'm struggling with is trying to understand what the
> purpose of having 2 files is given what you say you want to do.

Here's a fuller explanation. (The "["digits"]" is a different case.)
Understandably users with fast, free inet connections, using a
normal full-featured browser, who see text in between floating
pictures, like in a throw away paper magazine, may not realise
that repeat publishers have a fixed format, like they learned for
paper publishing.

They put their logos and adverts at the top, before the real
info that you may want to capture and keep.
And they might spread the info over 10 pages: again like paper
magazines: using multiple issues.

To accumulate the text of a sequence of http-pages and discard
the packaging-garbage, lynx/links/elinks are suitable fast
text-only fetchers. And a 2 argument script does:
AppendURLsToBook URLs <BookTitle>

The book will then look like this:
Ha|Hb|Hc|...Hn|

Where:
H = the redundantly repeated <pakaging> of each fetched page;
a, b, c...n = the sequence of 'articles' containing the info/knowledge.
| = a suitable <chapter divider>. I use a "<><>..<>" line and the URL

The purpose of the awk-one-liner is to delete the redundantly
repeated Hs.

The idea is that after you've read the 2nd fetched page, you
recognise the redundantly repeated 'H' block-of-text.
So then you can just wipe/mark it and 'say'
<delete all further copies of this>.

Unfortunately there's a potentially dangerous flaw in my
original specification.Since `awk` is line-based, I thought
of <delete any line in the header which is repeated>.

It should be <delelte any repeated WHOLE-BLOCK:H>
It can be used repeatedly if there's an H1 and a H2...

That sounds like a task not so suitable for a line-oriented
utility like `awk`?

I'd hate to miss the ease and power of *nix's line utilities,
and start grovelling down at char level.

= TIA.


Chick Tower

unread,
Sep 12, 2012, 4:39:08 PM9/12/12
to
On 2012-09-11, no.to...@gmail.com <no.to...@gmail.com> wrote:
> In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton wrote:
>> If that isn't what you want then please just provide a more representative,
>> small (~10 lines or so) input file and the expected output and explain the
>> mapping.
>
> No. I've got a private war against <explaining by examples> which is a bad as
> instructions like: klikA, klikB, klikC...
>
> Because this is a non deterministic task, I propose the horse-and-rider
> method; where you get visual feedback from the OS and you guide it.

You're an idiot. You bitch about the inadequacy of English to properly
describe your problem, but, for personal reasons, refuse to provide an
unambiguous example of your problem so others can help you solve it. Ed
even came up with a better solution than you expected, one that doesn't
require setting up a file of lines you wish to eliminate from a file of
copied web pages.

Since you seem willing to create these extra files, maybe you can tweak
the options you give the diff command to pare those unwanted lines out
of the output of the difference between your source file and you file of
unwanted lines. You would have to have one set of those unwanted lines
for each page in your source file, but you can just copy-and-paste them
as many times as you need. Perhaps there's even a modified version of
diff that will use the same set of lines over and over.

grep -v could do what you want, too, as suggested by others.

Here's another thought. Since you're willing to perform a task for each
of your source files, why not just edit the files and delete the lines
you don't want to see? Or, just skip over those lines when you read the
file, and not have any extra work to do?

I know you asked about awk, but I suspect you're more interested in a
solution than the use of awk for that solution.
--
Chick Tower

For e-mail: colm DOT sent DOT towerboy AT xoxy DOT net

Chick Tower

unread,
Sep 12, 2012, 4:43:28 PM9/12/12
to
On 2012-09-11, no.to...@gmail.com <no.to...@gmail.com> wrote:
> In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton wrote:
>> If that isn't what you want then please just provide a more representative,
>> small (~10 lines or so) input file and the expected output and explain the
>> mapping.
>
> No. I've got a private war against <explaining by examples> which is a bad as
> instructions like: klikA, klikB, klikC...
>
> Because this is a non deterministic task, I propose the horse-and-rider
> method; where you get visual feedback from the OS and you guide it.

Ed Morton

unread,
Sep 12, 2012, 4:49:35 PM9/12/12
to
Awk is not line-based, it's record-based - see below.

> I thought of <delete any line in the header which is repeated>.
>
> It should be <delelte any repeated WHOLE-BLOCK:H>
> It can be used repeatedly if there's an H1 and a H2...
>
> That sounds like a task not so suitable for a line-oriented
> utility like `awk`?

I still won't be able to help you overall without a concrete example but wrt the
above statement: unlike sed, awk is NOT line-based it's record-based. Look:

$ cat file
hello world
this is
some text

and here is
more

followed
by a final
block of text

$ awk '{print "Record", NR, "= <" $0 ">"}' file
Record 1 = <hello world>
Record 2 = <this is>
Record 3 = <some text>
Record 4 = <>
Record 5 = <and here is>
Record 6 = <more>
Record 7 = <>
Record 8 = <followed>
Record 9 = <by a final>
Record 10 = <block of text>

$ awk -v RS="" '{print "Record", NR, "= <" $0 ">"}' file
Record 1 = <hello world
this is
some text>
Record 2 = <and here is
more>
Record 3 = <followed
by a final
block of text>

Note how by just setting the Record Separator (RS) to a value other than it's
default (a newline) you can change awk's concept of a record from a single line
to block of lines separated by something else (empty lines in this case).

Likewise you can change how each record is automatically broken into fields just
be changing the value of the Field Separator (FS) variable:

$ awk -v RS="" '{
print "Record", NR, "= <" $0 ">"
for (i=1;i<=NF;i++)
print "\tField", i, "= ["$i"]"
}' file
Record 1 = <hello world
this is
some text>
Field 1 = [hello]
Field 2 = [world]
Field 3 = [this]
Field 4 = [is]
Field 5 = [some]
Field 6 = [text]
Record 2 = <and here is
more>
Field 1 = [and]
Field 2 = [here]
Field 3 = [is]
Field 4 = [more]
Record 3 = <followed
by a final
block of text>
Field 1 = [followed]
Field 2 = [by]
Field 3 = [a]
Field 4 = [final]
Field 5 = [block]
Field 6 = [of]
Field 7 = [text]

$ awk -v RS="" -v FS="\n" '{
print "Record", NR, "= <" $0 ">"
for (i=1;i<=NF;i++)
print "\tField", i, "= ["$i"]"
}' file
Record 1 = <hello world
this is
some text>
Field 1 = [hello world]
Field 2 = [this is]
Field 3 = [some text]
Record 2 = <and here is
more>
Field 1 = [and here is]
Field 2 = [more]
Record 3 = <followed
by a final
block of text>
Field 1 = [followed]
Field 2 = [by a final]
Field 3 = [block of text]

There's also xmlawk which understands XML and so might interest you if you're
parsing web pages.

With many awks the RS has to be a single character but with GNU awk you can set
RS to a Regular Expression just like you can for the FS, e.g.:

$ cat file
hello world
this is
some text
<end>
and here is
more
<END>
followed
by a final
block of text
<end>

$ awk -v RS='\n<(end|END)>\n' '{print "Record", NR, "= <" $0 ">"}' file
Record 1 = <hello world
this is
some text>
Record 2 = <and here is
more>
Record 3 = <followed
by a final
block of text>

Regards,

Ed.

>
> I'd hate to miss the ease and power of *nix's line utilities,
> and start grovelling down at char level.
>
> = TIA.
>


Posted using www.webuse.net

no.to...@gmail.com

unread,
Sep 13, 2012, 7:00:56 AM9/13/12
to
In article <201209122...@webuse.net>, "Ed Morton" <morto...@gmail.com> wrote:

> no.to...@gmail.com <no.to...@gmail.com> wrote:
>
]> That sounds like a task not so suitable for a line-oriented
]> utility like `awk`?

]I still won't be able to help you overall without a concrete example but wrt the
]above statement: unlike sed, awk is NOT line-based it's record-based. Look:
--- snip examples filed for <awk tutor> for later repeated use----

Yes, considering that the clean-up-script will be run, based on what
the human-user sees as redundant-garbage, and copies to the DeleteFile,
in a single step; this could perhaps be followed by a script to:
1= append a suitable record-separator-token to the DeleteFile,
2= delete all-except-the-1st redundant-garbage-record from
InPutFile > OutPutFile.
DeletingAll and then restoring the 1st:H is also OK.
Or a keybrd-macro could put the record-separator at the end
of the garbage-record, before the editor contents is
<save to InputFile and call CleanUpScript>.
It's getting more absurd?
----
Chick Tower's pointing to `diff` may be a solution.
I'll have to rest & think.
But I'm waiting for someone to point out how dumb I am, since
this is a plain editor-search-and-replace task!

It's always good to try and see the BIG picture: "why didn't I see
that?" It's because we 'progress' like evolution does: from where
we ARE, to the next stage; and we don't abandon our previous work
and start anew.
That's M$'s secret: users were able to evolve from DOS.

Of course I would have been using search-and-replace, if my
editorS already had [80*60=] 4800 char search/replace buffers.
Introducing new editors, can be as disruptive as new languages.

Here's your "concrete example" file of 300 lines with line
numbers added. The actual file has NOT got line numbers.

InputFile ==
1=http://www.bbc.com/news/technology-19425051
2 = paste any
3 = 49 lines here; representing 'H' in my 'schematic'
...
50 = the last line of 'H'
51 ...99 paste any 49-line-block which is different to H
100 = <><><><>
101=http://www.bbc.com/news/technology-19391235
102 = paste the SAME [redundant duplicated]
103 = 49 lines here; representing 'H' in my 'schematic'
...
150 = the last line of 'H'
151 ...99 paste any 49-line-block which is different to H
200 = <><><><>
201=http://www.bbc.com/news/technology-19572820
202 = paste the SAME [redundant duplicated]
203 = 98 lines here; representing 'H' in my 'schematic'
...
250 = the last line of 'H'
251 ...99 paste any 49-line-block which is different to H
300 = <><><><>

That's the input file.
The DeleteFile is the 49 lines H file.

This is my first guessed test script:---

Create DeleteFile ## 49 line H:file
Create InputFile:-
Empty the test inputFile: InF
FOR i := I TO 9
echo 'http://www.bbc.com/news/technology-123' >> InF
cat DeleteFile >> InF
cat Arbitrary49lineNotHfile >> InF
echo '<><><><>' >> InF
END-FOR-LOOP

cat InF | wc -l ## expect Inf to be 900 lines

<examine InF with editor and mark garbage-block
and run cleaner>

cat OutF | wc -l # expect OutF to be
(900 - 9*49 =900 - 450 + 9=459; I think) lines.

So the inputFile consists of N records,
each record having 4 sub-records:
1=URL; 2=repeatedHeader; 3=info; 4=separatorLine.

The outputFile, should have the repeatedHeader
deleted from all-but-the-1st record.

== `awk` is brutal punishment.


Ed Morton

unread,
Sep 13, 2012, 9:21:28 AM9/13/12
to
On 9/13/2012 6:00 AM, no.to...@gmail.com wrote:
> In article <201209122...@webuse.net>, "Ed Morton" <morto...@gmail.com> wrote:
>
>> no.to...@gmail.com <no.to...@gmail.com> wrote:
>>
> ]> That sounds like a task not so suitable for a line-oriented
> ]> utility like `awk`?
>
> ]I still won't be able to help you overall without a concrete example but wrt the
> ]above statement: unlike sed, awk is NOT line-based it's record-based. Look:
> --- snip examples filed for <awk tutor> for later repeated use----
>
> Yes, considering that the clean-up-script will be run, based on what
> the human-user sees as redundant-garbage, and copies to the DeleteFile,
> in a single step; this could perhaps be followed by a script to:
> 1= append a suitable record-separator-token to the DeleteFile,
> 2= delete all-except-the-1st redundant-garbage-record from
> InPutFile > OutPutFile.
> DeletingAll and then restoring the 1st:H is also OK.
> Or a keybrd-macro could put the record-separator at the end
> of the garbage-record, before the editor contents is
> <save to InputFile and call CleanUpScript>.
> It's getting more absurd?

Again, you're describing what you think is a solution to your problem rather
than describing your problem.

<snip>
No, that's some kind of framework for an input file with vague instructions for
how we might construct an input file from it by editing it and adding lines. Why
on earth would you think we'd waste our time doing that instead of you just
posting an input file when you're the one asking for help? And you do NOT need a
300 line input file to demonstrate your problem - you CAN do it in 20 lines or
less, I promise. If you have some block of 49 lines that have some relevance in
a 300 line file - just shrink the numbers and show us the problem using a 3 line
block in a 20 line file.

<snip>
> == `awk` is brutal punishment.

Can't imagine where that came from. Call my cynical but I'm starting to wonder
if the whole point of this thread was to try to demonstrate that no-one could
come up with an awk solution to your problem rather than to actually solve your
problem. I can't think of any other reason why you're being so evasive about
providing a concrete example of your problem.

Ed.

Dave Gibson

unread,
Sep 13, 2012, 1:47:42 PM9/13/12
to
[ Followup-To: set to comp.os.linux.misc ]

In comp.os.linux.misc, no.to...@gmail.com wrote:

[ processing a display-formatted (by lynx) dump of a web page ]

> But because the lines are derived from lynx/links,
> often the <URL-indices-in-square-brackets>
> vary. As per my example given.

Start by using lynx's -nolist option.

lynx -dump -nolist http://web/page...

Chick Tower

unread,
Sep 13, 2012, 3:15:29 PM9/13/12
to
On 2012-09-12, no.to...@gmail.com <no.to...@gmail.com> wrote:
> In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:
>> The thing I'm struggling with is trying to understand what the
>> purpose of having 2 files is given what you say you want to do.

< Snipped 46 lines of explanation from Chris >

And you still didn't answer Ed's question. You haven't even said
whether Ed's better solution works for you. He provided you an awk
script that should do what you want, and you start wondering if awk
can actually do what you want.

Ed, I suspect Chris thought he would need to specify the lines of text
he wanted to remove from his saved file of web pages so that awk would
know what lines to remove. (I would have thought so, too.)

*** Is this correct, Chris?

no.to...@gmail.com

unread,
Sep 14, 2012, 1:53:50 AM9/14/12
to
In article <k2qs5f$l5m$1...@dont-email.me>, Chick Tower <c.t...@deadspam.com> wrote:

>
> Since you seem willing to create these extra files, maybe you can tweak
> the options you give the diff command to pare those unwanted lines out
> of the output of the difference between your source file and you file of
> unwanted lines. You would have to have one set of those unwanted lines
> for each page in your source file, but you can just copy-and-paste them
> as many times as you need. Perhaps there's even a modified version of
> diff that will use the same set of lines over and over.
>
In *nix everything is a file. Think ito tmp-files or pipes. The M$ like
accumulation of masses files is exactly what I'm against.

> grep -v could do what you want, too, as suggested by others.
>
I've described the deficiencies of THAT method, which I previously used.

> Here's another thought. Since you're willing to perform a task for each
> of your source files, why not just edit the files and delete the lines
> you don't want to see? Or, just skip over those lines when you read the
> file, and not have any extra work to do?
>
I'm getting tired of doing it manually; like going down to the river to
drink every day, instead of having "programmed" water to the house
for "repeated" consumption.

Actually its a simple editor's search-and-replace task. But the editor
would need a 50*80=4000 char buffer.

> I know you asked about awk, but I suspect you're more interested in a
> solution than the use of awk for that solution.
> --
> Chick Tower
What matters is the amount of human effort.
You've got to see the stuff which you don't want repeated.
Then you just mark it,
and say <delete any further repeats of this block>.
It's 3 actions.


no.to...@gmail.com

unread,
Sep 14, 2012, 5:06:31 AM9/14/12
to
> In article <k2nc60$p5m$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:-
>
> I still won't be able to help you overall without a concrete example
>
I've just hit on your good concrete example, in my research on "how to
explain to law-people, who have a clerk-mentailty, and don't intuitively
know that cause-can't-come-AFTER-effect".

But first: normally theoretical knowledge trumps empirical 'examples'.
When your teacher told you that "2 apples plus 2 appel equals 4 apples",
did you say "show me the apples"?

Using a script, which calls a script, which fetches and appends the contents
of the URLs [in a file], being:-
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161424/
http://www.ncbi.nlm.nih.gov/pubmed/10815775
http://www.ncbi.nlm.nih.gov/pubmed/17083926
http://www.ncbi.nlm.nih.gov/pubmed/11934006
http://www.ncbi.nlm.nih.gov/pubmed/17188554
gave my concrete example. Sorry I can't send you 4 apples.

`lynx -dump http://www.ncbi.nlm.nih.gov/pubmed/17188554 >> AccumFile`
should eg. append the text-only web-page to AccumFile,
for your concrete example.

On examining the 5 fetches, I noticed that the
repeatedly-redundant blocks were quiet big.

-articles/PMC316142 differs from the other 4
by the <"["{digit}"]">.
But the other 4 have identical repeatedly-redundant block of
236-lines - each.

For these 5 fetches, the actual info, seems to be less that
20 lines average. So with the 'tail-garbage' too, the fetch
is over 90% garbage. And that's without the pictures!

IIRC `diff` outputs a syntax like:
<separator token>
<line number>
<the actual line>
so a parser could:
IF the <the actual line> cites a <"["{digit}"]"> THEN IsSame.
Ie. the entry for lines wich differ only by <"["{digit}"]"> could
be omitted/delete.
And if there were no diff-entries, whole block could be delete.
The approach allows a data driven incremental
patch-as-you-learn method.


WDYS?


J G Miller

unread,
Sep 14, 2012, 5:14:42 AM9/14/12
to
On Friday, September 14th, 2012, at 09:06:31h +0000, Chris Glur asked:

> But first: normally theoretical knowledge trumps empirical 'examples'.
> When your teacher told you that "2 apples plus 2 appel equals 4 apples",
> did you say "show me the apples"?

This is completely bogus. The child does not need to ask the teacher
to shew them the apples because the child is already familiar with
apples, which is why the teacher used them in the example.

(Substitute another fruit in non apple cultures.)

J G Miller

unread,
Sep 14, 2012, 5:22:46 AM9/14/12
to
On Friday, September 14th, 2012, at 05:53:50h +0000, Chris Glur explained:

> You've got to see the stuff which you don't want repeated.

If the blocks of text are repeated identically from one web
page to the next, then you could just use uniq.

NAME

uniq - report or omit repeated lines

DESCRIPTION

Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

Otherwise you will have to use regular expressions marking
the start and end of the blocks with either sed or awk or perl.

Ed Morton

unread,
Sep 14, 2012, 9:00:02 AM9/14/12
to
On 9/14/2012 4:06 AM, no.to...@gmail.com wrote:
<snip>
> But first: normally theoretical knowledge trumps empirical 'examples'.
> When your teacher told you that "2 apples plus 2 appel equals 4 apples",
> did you say "show me the apples"?

No, but

a) you're not our teacher, you're a peer asking us for help, and
b) if, before I'd ever taken a geometry class, a kid on the street had come up
to me and said "the square on the hypotenuse is equal to the sum of the squares
of the 2 shorter sides, can you help me do my homework?" without showing me a
triangle or telling me what a hypotenuse is I might have raised an eyebrow.

You know your domain so you think what you are telling us about your problem is
adequate for us to help you solve it. We don't know your domain so we don't have
enough information to help you solve your problem.

A small, concrete example is almost certainly all we need to complete the puzzle.

By an example I mean up to 20 lines of SPECIFIC text for input and the
corresponding SPECIFIC text output you'd expect given that input. Not a bunch of
instructions on what an input file could look like or how we could create one by
going to a bunch of web sites and pulling out data, and then vague statements on
what kind of format the output might be.

Ed.

The Natural Philosopher

unread,
Sep 14, 2012, 9:45:48 AM9/14/12
to
Ed Morton wrote:
> On 9/14/2012 4:06 AM, no.to...@gmail.com wrote:
> <snip>
>> But first: normally theoretical knowledge trumps empirical 'examples'.
>> When your teacher told you that "2 apples plus 2 appel equals 4 apples",
>> did you say "show me the apples"?
>
> No, but
>
> a) you're not our teacher, you're a peer asking us for help, and
> b) if, before I'd ever taken a geometry class, a kid on the street had
> come up to me and said "the square on the hypotenuse is equal to the sum
> of the squares of the 2 shorter sides, can you help me do my homework?"
> without showing me a triangle or telling me what a hypotenuse is I might
> have raised an eyebrow.
>
I'd have raised a toecap.



--
Ineptocracy

(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.

J G Miller

unread,
Sep 14, 2012, 10:55:49 AM9/14/12
to
On Friday, September 14th, 2012, at 14:45:48h +0100,
The Natural Philosopher wrote:

> I'd have raised a toecap.

And then you would have been charged with assaulting a child,
unlike Norman Tebbit.

The Natural Philosopher

unread,
Sep 14, 2012, 12:18:23 PM9/14/12
to
Are there any children like Norman Tebbitt?

Loki Harfagr

unread,
Sep 14, 2012, 12:44:00 PM9/14/12
to
Fri, 14 Sep 2012 17:18:23 +0100, The Natural Philosopher did cat :

> J G Miller wrote:
>> On Friday, September 14th, 2012, at 14:45:48h +0100,
>> The Natural Philosopher wrote:
>>
>>> I'd have raised a toecap.
>>
>> And then you would have been charged with assaulting a child,
>> unlike Norman Tebbit.
>
> Are there any children like Norman Tebbitt?

in "here dragons" land?

Ed Morton

unread,
Sep 14, 2012, 1:44:57 PM9/14/12
to
J G Miller <mil...@yoyo.ORG> wrote:

> On Friday, September 14th, 2012, at 09:06:31h +0000, Chris Glur asked:
>
> > But first: normally theoretical knowledge trumps empirical 'examples'.
> > When your teacher told you that "2 apples plus 2 appel equals 4 apples",
> > did you say "show me the apples"?
>
> This is completely bogus. The child does not need to ask the teacher
> to shew them the apples because the child is already familiar with
> apples, which is why the teacher used them in the example.

It's also not necessary to see an apple, it's just necessary to understand what
operation you're being asked to perform on them. In the current case the OP is
saying the equivalent of:

Operation X on J apples and K oranges equals L fruits. If I put the apples
and oranges in a duffle bag and weigh them and then subtract the weight of the
duffle bag and divide the result by the average weight of a piece of fruit and
round to the nearest integer value then I will get the value for L - how do I
code Operation X?

Eouldn't it be nice if he told us that when J is 2 and K is 3, L is 5 so we
could suggest the "+" operator instead?

Regards,

Ed.

>
> (Substitute another fruit in non apple cultures.)


Posted using www.webuse.net

no.to...@gmail.com

unread,
Sep 14, 2012, 4:35:32 PM9/14/12
to
In article <k2v9om$l3s$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:

> On 9/14/2012 4:06 AM, no.to...@gmail.com wrote:
> <snip>
> You know your domain so you think what you are telling us about your problem is
> adequate for us to help you solve it. We don't know your domain so we don't have
> enough information to help you solve your problem.
>
> A small, concrete example is almost certainly all we need to complete the puzzle.
>
> By an example I mean up to 20 lines of SPECIFIC text for input and the
> corresponding SPECIFIC text output you'd expect given that input. Not a bunch of
> instructions on what an input file could look like or how we could create one by
> going to a bunch of web sites and pulling out data, and then vague statements on
> what kind of format the output might be.
>
> Ed.
OK, I'll follow YOUR example format.
==> Here's FileIn : 20 lines:-
url1
start
of 3 lines of garbage
last garbage line
info-line1:url1
info-line2:url1
<><><><>
url2
start
of 3 lines of garbage
last garbage line
info-line1:url2
<><><><>
url3
start
of 3 lines of garbage
last garbage line
info-line1:url3
info-line2:url3
<><><><>

==> Here's FileDelete : 3 lines:-
start
of 3 lines of garbage
last garbage line

==> Here's FileOut : 20 -2*3 =14 lines:-
url1
start
of 3 lines of garbage
last garbage line
info-line1:url1
info-line2:url1
<><><><>
url2
info-line1:url2
<><><><>
url3
info-line1:url3
info-line2:url3
<><><><>


== TIA

Dave Gibson

unread,
Sep 14, 2012, 7:06:33 PM9/14/12
to
[ Followup-To: set to comp.lang.awk
awk -f the_following_script FileDelete FileIn > result

----script begins on next line
#! /usr/bin/awk -f

function flush_buffer(discard, n) {
if (!discard)
for (n = 1; n <= bufpos; n++)
print buffer[n]
bufpos = 0
}

NR == FNR {
delete_list[++delmax] = $0
next
}

$0 ~ delete_list[bufpos + 1] {
buffer[++bufpos] = $0
if (bufpos >= delmax)
flush_buffer(blocks_seen++)
next
}

bufpos {
flush_buffer(0)
}

{ print }

END {
flush_buffer(0)
}
----script ends on previous line

Ed Morton

unread,
Sep 14, 2012, 8:03:23 PM9/14/12
to
On 9/14/2012 3:35 PM, no.to...@gmail.com wrote:
> In article <k2v9om$l3s$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:
>
>> On 9/14/2012 4:06 AM, no.to...@gmail.com wrote:
>> <snip>
>> You know your domain so you think what you are telling us about your problem is
>> adequate for us to help you solve it. We don't know your domain so we don't have
>> enough information to help you solve your problem.
>>
>> A small, concrete example is almost certainly all we need to complete the puzzle.
>>
>> By an example I mean up to 20 lines of SPECIFIC text for input and the
>> corresponding SPECIFIC text output you'd expect given that input. Not a bunch of
>> instructions on what an input file could look like or how we could create one by
>> going to a bunch of web sites and pulling out data, and then vague statements on
>> what kind of format the output might be.
>>
>> Ed.
> OK, I'll follow YOUR example format.

Great. Thank you!
That'd just be this:

--------------------
$ cat FileDelete
start
of 3 lines of garbage
last garbage line
--------------------
$ cat FileIn
url1
start
of 3 lines of garbage
last garbage line
info-line1:url1
info-line2:url1
<><><><>
url2
start
of 3 lines of garbage
last garbage line
info-line1:url2
<><><><>
url3
start
of 3 lines of garbage
last garbage line
info-line1:url3
info-line2:url3
<><><><>
--------------------
$ cat tst.awk
BEGIN{ RS=ORS="\n<><><><>\n" }
NR==FNR { skip=$0; next }
found[skip]++ { sub(skip,"") }
{ print }
--------------------
$ gawk -f tst.awk FileDelete FileIn
url1
start
of 3 lines of garbage
last garbage line
info-line1:url1
info-line2:url1
<><><><>
url2
info-line1:url2
<><><><>
url3
info-line1:url3
info-line2:url3
<><><><>
--------------------

Note that I used GNU awk (gawk) as it allows the RS to be more than a single
character. There are various alternative solutions.

Does the above do what you want? I thought there was something about ignoring
white space and digits between square brackets so maybe you need to tweak your
input files a bit to be more representative of things like that?

Ed.

Steve Hayes

unread,
Sep 15, 2012, 1:25:53 AM9/15/12
to
On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:

>We need these heuristics to trim out the repetative garbage
>that even the minimalist lynx/links fetches from http-pages.

I'd be interested in knowing whether awk could trim unwanted lines from the
headers of saved e-mail messages, to remove all but senter, recipient, date,
subject and key words.

Has anyone developed an awk script for that?


--
Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com
E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

unruh

unread,
Sep 15, 2012, 11:27:02 AM9/15/12
to
On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>
>>We need these heuristics to trim out the repetative garbage
>>that even the minimalist lynx/links fetches from http-pages.
>
> I'd be interested in knowing whether awk could trim unwanted lines from the
> headers of saved e-mail messages, to remove all but senter, recipient, date,
> subject and key words.

Sure. I have no idea what the key for "keywords" is but for the rest

awk ' BEGIN {H=1}
H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
H==1 && $0 ~ /^$/ {H=0}
H == 0 {print $0}' nameoffile

should do it.

Ed Morton

unread,
Sep 15, 2012, 12:02:11 PM9/15/12
to
I've no idea if it solves the OPs problem or not but that can be written more
succinctly as:

awk ' BEGIN {H=1}
H == 1 && /^(Subject|To|From):/
H == 1 && /^$/ {H=0}
H == 0' nameoffile

Seems unlikely it'd work, though, given H is never reset to 1, unless there's
only 1 email in the file. If that's really all there is in the input file then
you could just do:

awk '/^(Subject|To|From):/{print; next} {exit}' nameoffile

For the OP, take a look at the tools formail and procmail. If you still need awk
after that, post a small representative input file and expected output to
comp.lang.awk.

Ed.

unruh

unread,
Sep 15, 2012, 2:02:55 PM9/15/12
to
On 2012-09-15, Ed Morton <morto...@gmail.com> wrote:
> On 9/15/2012 10:27 AM, unruh wrote:
>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>
>>>> We need these heuristics to trim out the repetative garbage
>>>> that even the minimalist lynx/links fetches from http-pages.
>>>
>>> I'd be interested in knowing whether awk could trim unwanted lines from the
>>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>>> subject and key words.
>>
>> Sure. I have no idea what the key for "keywords" is but for the rest
>>
>> awk ' BEGIN {H=1}
>> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
>> H==1 && $0 ~ /^$/ {H=0}
>> H == 0 {print $0}' nameoffile
>>
>> should do it.
>
> I've no idea if it solves the OPs problem or not but that can be written more
> succinctly as:
>
> awk ' BEGIN {H=1}
> H == 1 && /^(Subject|To|From):/
> H == 1 && /^$/ {H=0}
> H == 0' nameoffile

But of course at the expense of making it a bit less readable.
The concatenation of "ors" I did not know, nor that there is an
automatic print $0 if the file name exits

Sometimes prolixity is worth the extra coding tomake sure everyone
understands what is going on. Debugging is usually far far far more
expensive than coding or running, and anything that decreases debugging
time is worth it. Using neat tricks is almost never worth anything
although it is certainly worth knowing them.



>
> Seems unlikely it'd work, though, given H is never reset to 1, unless there's
> only 1 email in the file. If that's really all there is in the input file then

That is what I assumed.

> you could just do:
>
> awk '/^(Subject|To|From):/{print; next} {exit}' nameoffile

?? How would this be the same. It would seem that this would not what
was needed-- scan the headers until the first empty line, and after that
print everything.

Exactly how to determine that the next email starts I am not sure of.
or I would have used that to set H=1 again.

Ed Morton

unread,
Sep 15, 2012, 4:15:02 PM9/15/12
to
On 9/15/2012 1:02 PM, unruh wrote:
> On 2012-09-15, Ed Morton <morto...@gmail.com> wrote:
>> On 9/15/2012 10:27 AM, unruh wrote:
>>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>>
>>>>> We need these heuristics to trim out the repetative garbage
>>>>> that even the minimalist lynx/links fetches from http-pages.
>>>>
>>>> I'd be interested in knowing whether awk could trim unwanted lines from the
>>>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>>>> subject and key words.
>>>
>>> Sure. I have no idea what the key for "keywords" is but for the rest
>>>
>>> awk ' BEGIN {H=1}
>>> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
>>> H==1 && $0 ~ /^$/ {H=0}
>>> H == 0 {print $0}' nameoffile
>>>
>>> should do it.
>>
>> I've no idea if it solves the OPs problem or not but that can be written more
>> succinctly as:
>>
>> awk ' BEGIN {H=1}
>> H == 1 && /^(Subject|To|From):/
>> H == 1 && /^$/ {H=0}
>> H == 0' nameoffile
>
> But of course at the expense of making it a bit less readable.

I assume you mean it's less readable because it doesn't explicitly state the
default action of "print $0". If you're going to use awk the first thing you
need to understand is that an awk script is built of "condition { <action> }"
statements and that the default condition is "true" and the default action is
"print $0". Given you understand that, the above shouldn't be cryptic at all.

> The concatenation of "ors" I did not know, nor that there is an
> automatic print $0 if the file name exits

Ah, that would explain why you felt it was less readable then.

> Sometimes prolixity is worth the extra coding tomake sure everyone
> understands what is going on. Debugging is usually far far far more
> expensive than coding or running, and anything that decreases debugging
> time is worth it. Using neat tricks is almost never worth anything
> although it is certainly worth knowing them.

If you write an awk script like this:

awk '{print $3}' file

would you consider it a trick that you relied on the default condition of "true"
to cause your action to execute? If not then why would you consider relying on
the default action of "print $0" to be a trick? I hope you're not writing:

awk '1==1 {print $3}' file

or something to avoid both defaults!

>> Seems unlikely it'd work, though, given H is never reset to 1, unless there's
>> only 1 email in the file. If that's really all there is in the input file then
>
> That is what I assumed.
>
>> you could just do:
>>
>> awk '/^(Subject|To|From):/{print; next} {exit}' nameoffile
>
> ?? How would this be the same. It would seem that this would not what
> was needed-- scan the headers until the first empty line, and after that
> print everything.

Ah you're right I misunderstood the requirements and what your script was doing.
I thought it stopped printing after the headers. What you had was fine then
though I'd really have written it as:

awk '
inBody || /^(Subject|To|From):/ { print; next }
/^$/ { inBody=1 }
' nameoffile

Regards,

Ed.

Chick Tower

unread,
Sep 15, 2012, 5:02:43 PM9/15/12
to
On 2012-09-14, J G Miller <mil...@yoyo.ORG> wrote:
> If the blocks of text are repeated identically from one web
> page to the next, then you could just use uniq.

After looking at the man page, it appears uniq won't do the job, J G.
It has a note saying that the lines have to be adjacent, and suggests
sorting the input file before using uniq. That would leave the file a
jumbled mess.

Chick Tower

unread,
Sep 15, 2012, 5:02:44 PM9/15/12
to
On 2012-09-14, no.to...@gmail.com <no.to...@gmail.com> wrote:
> What matters is the amount of human effort.
> You've got to see the stuff which you don't want repeated.
> Then you just mark it,
> and say <delete any further repeats of this block>.
> It's 3 actions.

But, with Ed Morton's solution, and if J. G. Miller's suggestion of the
uniq command works, you don't have to see what you want deleted. You
could just run a command (which might be a script, for even easier use)
and out comes the file the way you want it, without text that is
repeated on each page. For that matter, it appears they would remove
text that is identical on any two pages.

Kaz Kylheku

unread,
Sep 15, 2012, 6:13:27 PM9/15/12
to
On 2012-09-12, no.to...@gmail.com <no.to...@gmail.com> wrote:
> They put their logos and adverts at the top, before the real
> info that you may want to capture and keep.
> And they might spread the info over 10 pages: again like paper
> magazines: using multiple issues.

Can you post the actual URL to the web page that you're trying to scrape, and
an example of what you want the scraped output to be?

I made a nice language which is ideal for scraping web pages:

Here is a web scraping question on Stack Overflow where the TXR solution
was accepted as the answer.

http://stackoverflow.com/questions/10055385/extract-text-from-html-table

Ed Morton

unread,
Sep 15, 2012, 10:13:34 PM9/15/12
to
Re-reading the thread just now in light of the example you gave, I think I have
a better idea of what you want to do. Below is an alternative solution that
constructs an RE from each record in FileDelete and then removes every string
matching each of those constructed REs from every record in FileIn after the
first time it's seen. It accounts for sequences of digits within square brackets
and sequences of blanks (what I'm referring to as variable strings below for
want of a better term) being unimportant in the comparison.

Ed.

$ cat FileDelete
[12] start
of 3 [345] lines of garbage
last [47] garbage [98765] line
-------------------------
$ cat FileIn
url1
[984] start
of 3 [1] lines of garbage
last [0909] garbage [12] line
info-line1:url1
info-line2:url1
<><><><>
url2
[123456] start
of 3 [0] lines of garbage
last [1] garbage [1] line
info-line1:url2
<><><><>
url3
[3] start
of 3 [28282] lines of garbage
last [999] garbage [888] line
info-line1:url3
info-line2:url3
<><><><>
-------------------------
$ cat tst.awk
BEGIN{ RS=ORS="\n<><><><>\n" }

NR == FNR {
# Escape all RE metacharacters
gsub( /[][\^$.*+?()|{}]/ , "\\\\&" )

# Get rid of all specified blanks at the start and end
# of each line so they won't be required in the matches.
gsub( /(^[[:blank:]]+|[[:blank:]]+$)/ , "" )
gsub( /[[:blank:]]*\n[[:blank:]]*/ , "\n" )

# Convert variable strings to the REs that would match them.
gsub( /\\[[][[:digit:]]+\\[]]/ , "\\[[[:digit:]]+\\]" )
gsub( /[[:blank:]]+/ , "[[:blank:]]+" )

# Allow optional blanks at the start and end of each line.
# Get rid of the final \n temporarily so the second gsub()
# below doesn't add blanks after it, then add it back after.
gsub( /(^|\n$)/ , "[[:blank:]]*" )
gsub( /\n/ , "[[:blank:]]*\n[[:blank:]]*" )
$0 = $0 "\n"

skip[$0]

next
}

{
for (s in skip) {
if (s in seen) {
gsub(s,"")
}
seen[s]
}

print
}
-------------------------
$ awk -f tst.awk FileDelete FileIn
url1
[984] start
of 3 [1] lines of garbage
last [0909] garbage [12] line
info-line1:url1
info-line2:url1
<><><><>
url2
info-line1:url2
<><><><>
url3
info-line1:url3
info-line2:url3
<><><><>
-------------------------

Steve Hayes

unread,
Sep 16, 2012, 9:19:23 AM9/16/12
to
On Sat, 15 Sep 2012 11:02:11 -0500, Ed Morton <morto...@gmail.com> wrote:

>On 9/15/2012 10:27 AM, unruh wrote:
>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>
>>>> We need these heuristics to trim out the repetative garbage
>>>> that even the minimalist lynx/links fetches from http-pages.
>>>
>>> I'd be interested in knowing whether awk could trim unwanted lines from the
>>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>>> subject and key words.
>>
>> Sure. I have no idea what the key for "keywords" is but for the rest
>>
>> awk ' BEGIN {H=1}
>> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
>> H==1 && $0 ~ /^$/ {H=0}
>> H == 0 {print $0}' nameoffile
>>
>> should do it.
>
>I've no idea if it solves the OPs problem or not but that can be written more
>succinctly as:

Ok, let me clarify with examples.

Here is a saved e-mail message:


===Begin ===
Return-Path:
sentto-16630095-180-1337274946-hayesstw=telkom...@returns.groups.yahoo.com
Received: from bianca.lb1.telkomsa.net (LHLO bianca.telkomsa.net)
(192.168.222.62) by mail6.telkomsa.net with LMTP; Thu, 17 May 2012 19:15:51
+0200 (SAST)
Received: from localhost (localhost [127.0.0.1])
by bianca.telkomsa.net (Postfix) with ESMTP id 6C23F301FD
for <online...@telkomsa.net>; Thu, 17 May 2012 19:15:51 +0200
(SAST)
X-Virus-Scanned: amavisd-new at telkomsa.net
Authentication-Results: bianca.telkomsa.net (amavisd-new); dkim=pass
header.i=@yahoogroups.com
Authentication-Results: bianca.telkomsa.net (amavisd-new); domainkeys=pass
header.sender=moderato...@yahoogroups.com
Authentication-Results: bianca.telkomsa.net (amavisd-new); dkim=softfail
(fail, message has been altered) header.i=@yahoogroups.com
Received: from bianca.telkomsa.net ([127.0.0.1])
by localhost (bianca.telkomsa.net [127.0.0.1]) (amavisd-new, port
10024)
with ESMTP id mgEH9BHEwdck for <online...@telkomsa.net>;
Thu, 17 May 2012 19:15:51 +0200 (SAST)
Received: from as3.telkomsa.net (unknown [196.25.211.210])
by bianca.telkomsa.net (Postfix) with ESMTP id 49C3C301F5
for <haye...@telkomsa.net>; Thu, 17 May 2012 19:15:51 +0200 (SAST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
AoACAG+DuE5iite3kWdsb2JhbACBX451MQ+NWwEBAQEJCQ0HEiijLgEFgT+CEQEEhhmURB0JiiA
Received: from ng9-ip4.bullet.mail.ne1.yahoo.com ([98.138.215.183])
by as3.telkomsa.net with SMTP; 17 May 2012 19:08:18 +0200
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoogroups.com;
s=lima; t=1337274947; bh=9s8eH6o6RY7NkSozNO/96FOchlVaW25YTLOrVsAGcUk=;
h=Received:Received:X-Yahoo-Newman-Id:Received:Received:Received:DKIM-Signature:Received:Received:X-Sender:X-Apparently-To:X-Received:X-Received:X-Received:X-Received:X-Received:To:Message-ID:User-Agent:X-Mailer:X-eGroups-Announce:X-Originating-IP:X-eGroups-Msg-Info:X-Yahoo-Post-IP:From:X-Yahoo-Profile:X-eGroups-Approved-By:Sender:MIME-Version:Mailing-List:Delivered-To:List-Id:Precedence:List-Unsubscribe:Date:Subject:Reply-To:X-Yahoo-Newman-Property:Content-Type:Content-Transfer-Encoding;
b=fqdU/CGSEO1SdOEY7bMnMVcFB8ilvF3fu3q0U7SQ8bpGBxUIJtRODrqUQrYd8zNe5fx6LeMHdH7HwoAjOUmQqEod6x0z1ze/bvHUvNZ4a+A300Qr7RWA7cmUh5GXKPdN
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=lima; d=yahoogroups.com;

b=XPbfbbyIyE0zgZ7W1CeKdDtETxvXiNTnwKurD4Or3kY51v4shjNfmgaU4KHLDkHH4FPr2aloCuq8aaj1U0a2dXI6kFxw1qzQegYc971RFVRusLZKVFjy6+HEsy/Y7WJ0;
Received: from [98.138.217.182] by ng9.bullet.mail.ne1.yahoo.com with NNFMP;
17 May 2012 17:15:47 -0000
Received: from [98.137.34.39] by tg7.bullet.mail.ne1.yahoo.com with NNFMP; 17
May 2012 17:15:47 -0000
X-Yahoo-Newman-Id: 16630095-m180
Received: (qmail 74076 invoked from network); 17 May 2012 17:15:44 -0000
Received: from unknown (98.137.35.162)
by m3.grp.sp2.yahoo.com with QMQP; 17 May 2012 17:15:44 -0000
Received: from unknown (HELO ng8-vm5.bullet.mail.gq1.yahoo.com)
(98.136.219.96)
by mta6.grp.sp2.yahoo.com with SMTP; 17 May 2012 17:15:44 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoogroups.com;
s=lima; t=1337274943; bh=E25f9Jn7R3mAH/bUmqPGbJIuPScm2XjgEXLEl4PESHE=;
h=Received:Received:X-Sender:X-Apparently-To:X-Received:X-Received:X-Received:X-Received:X-Received:Date:To:Message-ID:User-Agent:MIME-Version:X-Mailer:X-Yahoo-Newman-Property:X-eGroups-Announce:X-Originating-IP:X-eGroups-Msg-Info:X-Yahoo-Post-IP:From:Subject:X-Yahoo-Group-Post:X-Yahoo-Profile:Content-Type:Content-Transfer-Encoding:X-YGroups-SubInfo:Sender:X-eGroups-Approved-By:X-eGroups-Auth;
b=qip3AqCMwHcEkX2NWCTUoyEG9t18bWrCi49UnhIhFDfCPC3HzOpxAAEWfSUcfX0cNUTuYU+coltMUqy1QkTywlXKGGsrk6bBxgSz/v0MkP3rweOGwjthSrGb09ZDs71c
Received: from [98.137.0.86] by ng8.bullet.mail.gq1.yahoo.com with NNFMP; 17
May 2012 17:15:43 -0000
Received: from [98.137.34.36] by tg6.bullet.mail.gq1.yahoo.com with NNFMP; 17
May 2012 17:15:43 -0000
X-Sender: apar...@yahoo-inc.com
X-Apparently-To: moderato...@yahoogroups.com
X-Received: (qmail 3856 invoked from network); 17 May 2012 17:14:27 -0000
X-Received: from unknown (98.137.35.161)
by m7.grp.sp2.yahoo.com with QMQP; 17 May 2012 17:14:27 -0000
X-Received: from unknown (HELO ng12-vm5.bullet.mail.gq1.yahoo.com)
(98.136.219.148)
by mta5.grp.sp2.yahoo.com with SMTP; 17 May 2012 17:14:27 -0000
X-Received: from [98.137.0.82] by ng12.bullet.mail.gq1.yahoo.com with NNFMP;
17 May 2012 17:14:25 -0000
X-Received: from [98.137.34.72] by tg2.bullet.mail.gq1.yahoo.com with NNFMP;
17 May 2012 17:14:25 -0000
To: moderato...@yahoogroups.com
Message-ID: <jp3bl...@eGroups.com>
User-Agent: eGroups-EW/0.82
X-Mailer: Yahoo Groups Message Poster
X-eGroups-Announce: yes
X-Originating-IP: 122.172.12.227
X-eGroups-Msg-Info: 2:3:4:0:0
X-Yahoo-Post-IP: 122.172.12.227
From: "Ashish Parnami 's" <apar...@yahoo-inc.com>
X-Yahoo-Profile: ashp_jai
X-eGroups-Approved-By: ashp_jai <apar...@yahoo-inc.com> via web; 17 May 2012
17:15:43 -0000
Sender: moderato...@yahoogroups.com
MIME-Version: 1.0
Mailing-List: list moderato...@yahoogroups.com; contact
moderatorce...@yahoogroups.com
Delivered-To: mailing list moderato...@yahoogroups.com
List-Id: <moderatorcentral.yahoogroups.com>
Precedence: bulk
List-Unsubscribe: <mailto:moderatorcentr...@yahoogroups.com>
Date: Thu, 17 May 2012 17:14:25 -0000
Subject: [moderatorcentral] Announcement: Upcoming upgrade of Yahoo! Groups
Calendar
Reply-To: moderatorce...@yahoogroups.com
X-Yahoo-Newman-Property: groups-email-tradt-m
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Antivirus: avast! (VPS 120517-0, 2012-05-17), Inbound message
X-Antivirus-Status: Clean
X-PMFLAGS: 34095232 0 1 PFTVJDNC.CNM
X-CC-Diagnostic: Body contains "to learn more about" (30)

Dear Y! Groups Users,

I am really happy to announce that we have launched a new version of
calendar for Y! Groups!! I hope you'll enjoy the change and am eager to
listen to what you have to say.

Here are a few enhancements:

* New Improved UI: New Y! Groups Calendar comes with much improved
performance and cleaner look and feel

* Y! Groups Calendar within your personal Calendar: If you are a
Yahoo! Calendar user, you will like that you can view and edit groups
calendar events directly from within your personal calendar, subject to
appropriate permissions.

* Sync Groups Calendar on Mobile: Similar to your personal calendar,
your group calendars also can now be synced on mobile
<http://help.yahoo.com/kb/index?page=content&id=SLN5906&actp=search&view\
locale=en_US&searchid=1337245198152&locale=en_US&y=PROD_GRPS> devices
supporting CalDAV like iPhone, iPad etc.
US groups created after 16th May 2012 will get the new version of
calendar. Support for other regions will be available soon.

In next few weeks, we will be migrating calendars for all existing
groups to the newer version. Also, to get your early feedback, we'll
migrate moderatorcentral group sooner in the cycle.

We request you to try out the new Y! Groups Calendar and share your
feedback. To learn more about this update, please visit Y! Groups help
page here (help.yahoo.com/kb/index?page=content&id=SLN5906).

Regards,
Ashish Parnami
Product Manager for Y! Groups


[Non-text portions of this message have been removed]



------------------------------------

Visit the Groups Blog at: http://www.ygroupsblog.com/blog/Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/moderatorcentral/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/moderatorcentral/join
(Yahoo! ID required)

<*> To change settings via email:
moderatorce...@yahoogroups.com
moderatorcentr...@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
moderatorcentr...@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/


-- End --

=== End quoted message ===

And I want to know if awk can convert it to this:

=== Begin quote ===

From: "Ashish Parnami 's" <apar...@yahoo-inc.com>
Sender: moderato...@yahoogroups.com
Date: Thu, 17 May 2012 17:14:25 -0000
Subject: [moderatorcentral] Announcement: Upcoming upgrade of Yahoo! Groups
Calendar

Dear Y! Groups Users,

I am really happy to announce that we have launched a new version of
calendar for Y! Groups!! I hope you'll enjoy the change and am eager to
listen to what you have to say.

Here are a few enhancements:

* New Improved UI: New Y! Groups Calendar comes with much improved
performance and cleaner look and feel

* Y! Groups Calendar within your personal Calendar: If you are a
Yahoo! Calendar user, you will like that you can view and edit groups
calendar events directly from within your personal calendar, subject to
appropriate permissions.

* Sync Groups Calendar on Mobile: Similar to your personal calendar,
your group calendars also can now be synced on mobile
<http://help.yahoo.com/kb/index?page=content&id=SLN5906&actp=search&view\
locale=en_US&searchid=1337245198152&locale=en_US&y=PROD_GRPS> devices
supporting CalDAV like iPhone, iPad etc.
US groups created after 16th May 2012 will get the new version of
calendar. Support for other regions will be available soon.

In next few weeks, we will be migrating calendars for all existing
groups to the newer version. Also, to get your early feedback, we'll
migrate moderatorcentral group sooner in the cycle.

We request you to try out the new Y! Groups Calendar and share your
feedback. To learn more about this update, please visit Y! Groups help
page here (help.yahoo.com/kb/index?page=content&id=SLN5906).

Regards,
Ashish Parnami
Product Manager for Y! Groups


[Non-text portions of this message have been removed]



------------------------------------

Visit the Groups Blog at: http://www.ygroupsblog.com/blog/Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/moderatorcentral/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/moderatorcentral/join
(Yahoo! ID required)

<*> To change settings via email:
moderatorce...@yahoogroups.com
moderatorcentr...@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
moderatorcentr...@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/


-- End --

=== End quote ===

Manuel Collado

unread,
Sep 16, 2012, 11:44:49 AM9/16/12
to
El 16/09/2012 15:19, Steve Hayes escribi�:
> On Sat, 15 Sep 2012 11:02:11 -0500, Ed Morton <morto...@gmail.com> wrote:
>
>> On 9/15/2012 10:27 AM, unruh wrote:
>>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>>
>>>>> We need these heuristics to trim out the repetative garbage
>>>>> that even the minimalist lynx/links fetches from http-pages.
> ......
> Ok, let me clarify with examples.
>
> Here is a saved e-mail message:
>
> ===Begin ===
> Return-Path:
> sentto-16630095-180-1337274946-hayesstw=telkom...@returns.groups.yahoo.com
> Received: from bianca.lb1.telkomsa.net (LHLO bianca.telkomsa.net)
> (192.168.222.62) by mail6.telkomsa.net with LMTP; Thu, 17 May 2012 19:15:51
> +0200 (SAST)
> Received: from localhost (localhost [127.0.0.1])
> by bianca.telkomsa.net (Postfix) with ESMTP id 6C23F301FD
> for <online...@telkomsa.net>; Thu, 17 May 2012 19:15:51 +0200
> (SAST)
> X-Virus-Scanned: amavisd-new at telkomsa.net
> Authentication-Results: bianca.telkomsa.net (amavisd-new); dkim=pass
> header.i=@yahoogroups.com
> ......
> -- End --
>
> === End quoted message ===
>
> And I want to know if awk can convert it to this:
>
> === Begin quote ===
>
> From: "Ashish Parnami 's" <apar...@yahoo-inc.com>
> Sender: moderato...@yahoogroups.com
> Date: Thu, 17 May 2012 17:14:25 -0000
> Subject: [moderatorcentral] Announcement: Upcoming upgrade of Yahoo! Groups
> Calendar
> ......
> -- End --
>
> === End quote ===

What you want is exactly what I get when "saving as text" the message in my
Thunderbird e-mail reader. Don't know if other e-mail readers also
distinguish among "saving as text" and "saving as e-mail".

Hope this helps.

PS: of course, awk can convert the format as you want, but if your e-mail
reader can directly do the above, using awk will be overkill.

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado



unruh

unread,
Sep 16, 2012, 2:05:50 PM9/16/12
to

And have you tried the suggestions?


On 2012-09-16, Steve Hayes <haye...@telkomsa.net> wrote:
> On Sat, 15 Sep 2012 11:02:11 -0500, Ed Morton <morto...@gmail.com> wrote:
>
>>On 9/15/2012 10:27 AM, unruh wrote:
>>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>>
>>>>> We need these heuristics to trim out the repetative garbage
>>>>> that even the minimalist lynx/links fetches from http-pages.
>>>>
>>>> I'd be interested in knowing whether awk could trim unwanted lines from the
>>>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>>>> subject and key words.
>>>
>>> Sure. I have no idea what the key for "keywords" is but for the rest
>>>
>>> awk ' BEGIN {H=1}
>>> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
>>> H==1 && $0 ~ /^$/ {H=0}
>>> H == 0 {print $0}' nameoffile
>>>
>>> should do it.
>>
>>I've no idea if it solves the OPs problem or not but that can be written more
>>succinctly as:
>
> Ok, let me clarify with examples.
>
> Here is a saved e-mail message:
<Rest deleted>

Dave Gibson

unread,
Sep 17, 2012, 6:05:18 AM9/17/12
to
[ Followup-To: set to comp.lang.awk]

In comp.lang.awk, Steve Hayes <haye...@telkomsa.net> wrote:

>>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:

>>>> I'd be interested in knowing whether awk could trim unwanted lines
>>>> from the headers of saved e-mail messages, to remove all but
>>>> senter, recipient, date, subject and key words.

> Ok, let me clarify with examples.
>
> Here is a saved e-mail message:

[snip]

> And I want to know if awk can convert it to this:
>
> === Begin quote ===
>
> From: "Ashish Parnami 's" <......................>
> Sender: moderatorcentral@...............
> Date: Thu, 17 May 2012 17:14:25 -0000
> Subject: [moderatorcentral] Announcement: Upcoming upgrade of Yahoo! Groups

[snip message body]

----script begins on next line
#! /usr/bin/awk -f

BEGIN {

# Should "to" be in this list?
save_headers = "sender|from|date|subject|keywords"

hpat = "^(" save_headers "):"
}

# First empty line (RFC822 mandates cr/lf eols) marks end of headers
# Set flag to print everything from that point on.
/^\r?$/ { body = 1 }

body { print ; next }

# hdr is set if an interesting header has been seen. This line may be
# a continuation of that header.
hdr {
if ($0 ~ /^[[:blank:]]/) {
print
next
}
hdr = 0
}

# Header fieldnames are case insensitive
tolower($0) ~ hpat {
print
hdr = 1

Ed Morton

unread,
Sep 17, 2012, 10:13:49 AM9/17/12
to
Sorry, I was so focused on getting the RE construction from FileDelete right I
didn't give much thought to how I'd perform the replacement in FileIn. Looking
at my previous posting now that part of the script is nonsense an below is how
it should be.

Chris Glur - please let us know whether or not the script below does what you
wanted. If not, let us know what it does wrong.

Ed.

$ echo "-----------------------------"
-----------------------------
$ cat FileDelete
[12] start
of 3 [345] lines of garbage
last [47] garbage [98765] line
$ echo "-----------------------------"
-----------------------------
$ cat FileIn
url1
[984] start
of 3 [1] lines of garbage
last [0909] garbage [12] line
info-line1:url1
info-line2:url1
<><><><>
url2
[123456] start
of 3 [0] lines of garbage
last [1] garbage [1] line
info-line1:url2
<><><><>
url3
[3] start
of 3 [28282] lines of garbage
last [999] garbage [888] line
info-line1:url3
info-line2:url3
<><><><>
$ echo "-----------------------------"
-----------------------------
$ cat tst.awk
BEGIN{ RS=ORS="\n<><><><>\n" }

NR == FNR {
# Escape all RE metacharacters
gsub( /[][\^$.*+?()|{}]/ , "\\\\&" )

# Get rid of all specified blanks at the start and end
# of each line so they won't be required in the matches.
gsub( /(^[[:blank:]]+|[[:blank:]]+$)/ , "" )
gsub( /[[:blank:]]*\n[[:blank:]]*/ , "\n" )

# Convert variable strings to the REs that would match them.
gsub( /\\[[][[:digit:]]+\\[]]/ , "\\[[[:digit:]]+\\]" )
gsub( /[[:blank:]]+/ , "[[:blank:]]+" )

# Allow optional blanks at the start and end of each line.
# Get rid of the final \n temporarily so the second gsub()
# below doesn't add blanks after it, then add it back after.
gsub( /(^|\n$)/ , "[[:blank:]]*" )
gsub( /\n/ , "[[:blank:]]*\n[[:blank:]]*" )
$0 = $0 "\n"

skip[$0]

next
}

{
for (s in skip) {
# Only remove occurrences of s after the first occurrence.
# "target" contains a copy of the sub-string of the current
# record in which to replace occurrences of s.

preTarget = ""
target = $0

if (!(s in seen) && match($0,"(^|\n)"s) ) {

# s has never occurred in any previous record and there
# is at least one occurrence of s in the current record.
# Prepare to skip the first occurrence and get rid of
# any remaining occurrences.
preTarget = substr($0,1,RSTART+RLENGTH-1)
target = substr($0,RSTART+RLENGTH)

seen[s]
}

# Remove s only if it occurs at the start of
# the record or immediately following a newline.
# Do not remove s if it occurs mid-line.
sub( "^"s , "" , target )
gsub( "\n"s , "\n" , target )

$0 = preTarget target

}

print
}
$ echo "-----------------------------"
-----------------------------
$ awk -f tst.awk FileDelete FileIn
url1
[984] start
of 3 [1] lines of garbage
last [0909] garbage [12] line
info-line1:url1
info-line2:url1
<><><><>
url2
info-line1:url2
<><><><>
url3
info-line1:url3
info-line2:url3
<><><><>
$ echo "-----------------------------"
-----------------------------

Posted using www.webuse.net

no.to...@gmail.com

unread,
Sep 18, 2012, 1:26:27 AM9/18/12
to
In article <k33ckf$srb$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:

> On 9/14/2012 7:03 PM, Ed Morton wrote:
> > On 9/14/2012 3:35 PM, no.to...@gmail.com wrote:
> >> In article <k2v9om$l3s$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:
> >>
> >>> <snip>

> --------------------
> $ cat tst.awk
> BEGIN{ RS=ORS="\n<><><><>\n" }
> NR==FNR { skip=$0; next } # FileDelete: singleRecord -> skip
> found[skip]++ { sub(skip,"") }
> { print }
> --------------------
> $ gawk -f tst.awk FileDelete FileIn
> Note that I used GNU awk (gawk)
> as it allows the RS to be more than a single
> character. There are various alternative solutions.
> ---------------
=> copy FileIn to L
=> copy 2 <deleteBlocks> from L to d1, d1a

-> cat L | wc -l == 3112
-> cat d1 | wc -l == 26
-> cat d1a | wc -l == 26
--> ls -l d1* ==
-rw-r--r-- 1 root root 336 2012-09-17 07:19 d1
-rw-r--r-- 1 root root 832 2012-09-17 07:20 d1a

-> gawk -f tst.awk d1 L | wc -l == 3114
=?=> what are the 2 extra lines ?

-> gawk -f tst.awk d1 L > La

-> ls -l L* ==
-rw-r--r-- 1 root root 145590 2012-09-17 06:48 L
-rw-r--r-- 1 root root 145600 2012-09-17 07:31 La

==> so it's got 10 bytes extra.
==> L ends with:------
110. http://www.iep.utm.edu/?p=601
<><><><><>
-----------------
And -/tmp/La ends with:-----
110. http://www.iep.utm.edu/?p=601
<><><><><>

<><><><>
-------------
=> so there's an extra "<><><><>" introduced.

A major problem of under specifying the requirement
is the inflexible description of <put a line under it>,
which I specified as "<><><><>"
but in my test: FileIn is " <><><><><>"
and could vary in size, from time-to-time, because
<put a line under it> is a concept, valid for many sizes of line.

=> so patch the script [that's why it's nice to have the args
separated from the code, in their own file], to use:
"\n<><><><><>\n"

=> with modified tst.awk to atst.awk re-test:
-> gawk -f atst.awk d1 L | wc -l == 3112 ?! That's disapointing ?
and I can't easily debug it cos I don't understand the code-details.
The output ends ok:---
109. http://www.iep.utm.edu/sitemap/
110. http://www.iep.utm.edu/?p=601
<><><><><>

-> gawk -f atst.awk d1 L > La
-> ls -l L* ==
-rw-r--r-- 1 root root 145590 2012-09-17 06:48 L
-rw-r--r-- 1 root root 145590 2012-09-17 08:00 La

So the problem of writing an extra terminating line:
"\n<><><><>\n"
has been removed, but no blocks have been deleted.

==> It's inappropriate to post the full files, but if you can't
immediately see the problem-cause; either an elaboration
on my partial code-comments,
or appropriate tests may help to fix it.

BTW I've got decades of files where I've
used "<>...<>" of varying length
to <draw a line under it>.
So if "<>...<>" could be a Regex that would be good.

==> After further testing with DG's script, I've guessed that the
problem is caused by the 2 special-chars: "[", "]"
--------------------- <- a differnt type of <under-line>
=> Let's try a 1-line-block, in 'FileDelete' :
-> gawk -f atst.awk OneLine L | wc -l == 3110 Good.

Testing with a deleteFile which contains no ][ problem-chars:-
-> gawk -f atst.awk a8Lines L | wc -l 3060
=> 3112 - 3060 == 52 == 2*26
==> 2 26-line-blocks were deleted
=> ? how many deleteBlocks did the inFile contain?
=> select ONE line in the 26-line-block-PATTERN and grep-count:
-> cat L | grep "15. http://www.iep.utm.edu/h/" | wc -l == 3
=> So the DeleteFile appears 3-times in the InFile, and should be
deleted twice. Ie. 2*26 == 52 lines should be deleted,
as appears. = good.

-> gawk -f atst.awk 8Lines L | wc -l == 3112
============

It's just a coincidence that squareBrackets enclosing digits
should be ignored, for the matching, and they ALSO give a
problem; but I suspect that OTHER problematic specialChars exist.

This little problem shows how difficult it is when there's a
big gap between the designer and the tester.

If the code is understood [via comments] by the tester, it
can be adjusted accordingly.

Thanks,

== Chris Glur.






no.to...@gmail.com

unread,
Sep 18, 2012, 3:52:52 AM9/18/12
to
In article <201209171...@webuse.net>, "Ed Morton" <morto...@gmail.com> wrote:

> please let us know whether or not the script below does
> what you wanted. If not, let us know what it does wrong.

To remove the problem of my varying length "<>..<>",
I could 'pre-normalise' them by:
sed s/OneOrMore: "<>" => StandardLen: "<>"

My posts hopefully show a sequence of:
{
=> :what ideas I get
-> :what code I use to try to test the ideas
== :what the results of the code are
}

There's more return-on-effort by improving the
human-interface than the bare-code.

So if you 'formulated' your algorithm, and punctuated
it by tests to confirm the critical steps, that would be
a good method, rather than me quessing what intermediate
tests to do.

In order for the med-doc to sensibly test/diagnose, he
first needs a mental-model of the test-item/patient.
So, eg. many tests applicable to a female don't apply to
a male patient, and the mental-model tells that.

But sure, I'll start testing after I've had a rest.

Thanks.


Janis Papanagnou

unread,
Sep 18, 2012, 4:21:22 AM9/18/12
to
On 18.09.2012 09:52, no.to...@gmail.com wrote:
> In article <201209171...@webuse.net>, "Ed Morton" <morto...@gmail.com> wrote:
>
>> please let us know whether or not the script below does
>> what you wanted. If not, let us know what it does wrong.
>
> To remove the problem of my varying length "<>..<>",
> I could 'pre-normalise' them by:
> sed s/OneOrMore: "<>" => StandardLen: "<>"

Or, since Ed's solution already uses a non-standard gawk RS, just
redefine the RS using a regexp pattern like \n(<>)+\n

Janis

Steve Hayes

unread,
Sep 18, 2012, 7:27:38 AM9/18/12
to
On Sat, 15 Sep 2012 11:02:11 -0500, Ed Morton <morto...@gmail.com> wrote:

I tried the scripts, but each one reported a syntax error.

But then I'm a complete AWK novice, and was really only asking if it was
possible - in other words, if AWK was suitable for this.

I'll have a look at comp.lang.awk, but it'll probably be way over my head.

Nevertheless, the idea of AWK appeals to me.

Janis Papanagnou

unread,
Sep 18, 2012, 8:07:22 AM9/18/12
to
If you'd post the error (and the corresponding code) we might help you
further.

>
> But then I'm a complete AWK novice, and was really only asking if it was
> possible - in other words, if AWK was suitable for this.

Certainly it is.

Janis

Ed Morton

unread,
Sep 18, 2012, 9:15:51 AM9/18/12
to
Then you're probably using old, broken awk.

> But then I'm a complete AWK novice, and was really only asking if it was
> possible - in other words, if AWK was suitable for this.

Yes, awk is suitable for ANY text processing application.
>
> I'll have a look at comp.lang.awk, but it'll probably be way over my head.

I doubt it. There's a couple of concepts to grasp then it's all pretty simple stuff.

Ed.

Ed Morton

unread,
Sep 18, 2012, 9:18:01 AM9/18/12
to
On 9/18/2012 7:22 AM, Steve Hayes wrote:
> On Mon, 17 Sep 2012 16:27:21 GMT, "Ed Morton" <morto...@gmail.com> wrote:
>
>> Try this (untested):
>>
>> awk '
>> BEGIN{ keep="From|Sender|Date|Subject" }
>> /^$/ { inBody=1 }
>> inBody { print; next }
>> /^[^[:blank:]]+:$/ { sect = $1 }
>> sect ~ "^" keep ":"
>> ' file
>
> Thanks - that one ran.
>
> I saved the bit between ' and ' as script4.awk
>
> then did
>
> awk script4 inputfile >> outputfile
>
> It didn't remove all the unwanted bits, but did remove a lot of them. Thanks!

There's a bug in it. Try this instead:

awk '
BEGIN{ keep="From|Sender|Date|Subject" }
/^$/ { inBody=1 }
inBody { print; next }
/^[^[:blank:]]+:[[:blank:]]/ { sect = $1 }
sect ~ "^" keep ":"
' file

Regards,

Ed.

Ed Morton

unread,
Sep 18, 2012, 9:24:51 AM9/18/12
to
On 9/18/2012 12:26 AM, no.to...@gmail.com wrote:
> In article <k33ckf$srb$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:
<snip>
>> --------------------
>> $ cat tst.awk
>> BEGIN{ RS=ORS="\n<><><><>\n" }
>> NR==FNR { skip=$0; next } # FileDelete: singleRecord -> skip
>> found[skip]++ { sub(skip,"") }
>> { print }
>> --------------------
>> $ gawk -f tst.awk FileDelete FileIn
<snip>
> BTW I've got decades of files where I've
> used "<>...<>" of varying length
> to <draw a line under it>.
> So if "<>...<>" could be a Regex that would be good.

Change the RS to "\n(<>)+\n" as Janis suggested and set ORS to RT at the start
of each action block instead of in BEGIN to avoid that string getting added at
the end of the final record for those cases where no RS is present at the end of
the final record in your input file:

$ cat file
foo
<><><><>
bar
<><><><><>
zed
$ echo "---------------------"
---------------------
$ gawk 'BEGIN{RS=ORS="\n<><><><>\n"} {print "{#" $0 "#}" }' file
{#foo#}
<><><><>
{#bar
<><><><><>
zed
#}
<><><><>
$ echo "---------------------"
---------------------
$ gawk 'BEGIN{RS=ORS="\n(<>)+\n"} {print "{#" $0 "#}" }' file
{#foo#}
(<>)+
{#bar#}
(<>)+
{#zed
#}
(<>)+
$ echo "---------------------"
---------------------
$ gawk 'BEGIN{RS="\n(<>)+\n"} {ORS=RT; print "{#" $0 "#}" }' file
{#foo#}
<><><><>
{#bar#}
<><><><><>
{#zed
#}$

>
> ==> After further testing with DG's script, I've guessed that the
> problem is caused by the 2 special-chars: "[", "]"

Could be.

> --------------------- <- a differnt type of <under-line>
> => Let's try a 1-line-block, in 'FileDelete' :
> -> gawk -f atst.awk OneLine L | wc -l == 3110 Good.
>
> Testing with a deleteFile which contains no ][ problem-chars:-
> -> gawk -f atst.awk a8Lines L | wc -l 3060
> => 3112 - 3060 == 52 == 2*26
> ==> 2 26-line-blocks were deleted
> => ? how many deleteBlocks did the inFile contain?
> => select ONE line in the 26-line-block-PATTERN and grep-count:
> -> cat L | grep "15. http://www.iep.utm.edu/h/" | wc -l == 3
> => So the DeleteFile appears 3-times in the InFile, and should be
> deleted twice. Ie. 2*26 == 52 lines should be deleted,
> as appears. = good.
>
> -> gawk -f atst.awk 8Lines L | wc -l == 3112
> ============
>
> It's just a coincidence that squareBrackets enclosing digits
> should be ignored, for the matching, and they ALSO give a
> problem; but I suspect that OTHER problematic specialChars exist.

Yes, see the more complete script I posted which escapes all RE metacharacters.

> This little problem shows how difficult it is when there's a
> big gap between the designer and the tester.
>
> If the code is understood [via comments] by the tester, it
> can be adjusted accordingly.

The complete script I posted has several comments. I recommend we focus on that
from now on rather than trying to debug this initial small script that AFAIK
will never do what you really want.

Ed.
>
> Thanks,
>
> == Chris Glur.
>
>
>
>
>
>

Steve Hayes

unread,
Sep 18, 2012, 11:14:42 AM9/18/12
to
The last one Ed posted ran OK.

Steve Hayes

unread,
Sep 18, 2012, 11:38:20 AM9/18/12
to
On Mon, 17 Sep 2012 11:05:18 +0100, dave.gma...@googlemail.com.invalid
(Dave Gibson) wrote:


>----script begins on next line
>#! /usr/bin/awk -f
>
>BEGIN {
>
> # Should "to" be in this list?
> save_headers = "sender|from|date|subject|keywords"
>
> hpat = "^(" save_headers "):"
>}
>
> # First empty line (RFC822 mandates cr/lf eols) marks end of headers
> # Set flag to print everything from that point on.
>/^\r?$/ { body = 1 }
>
>body { print ; next }
>
> # hdr is set if an interesting header has been seen. This line may be
> # a continuation of that header.
>hdr {
> if ($0 ~ /^[[:blank:]]/) {
> print
> next
> }
> hdr = 0
>}
>
> # Header fieldnames are case insensitive
>tolower($0) ~ hpat {
> print
> hdr = 1
>}
>----script ends on previous line

Thanks for that.

When I tried it, I got this message:

awk line 26:function not defined tolower

Janis Papanagnou

unread,
Sep 18, 2012, 12:37:58 PM9/18/12
to
Am 18.09.2012 17:38, schrieb Steve Hayes:
> On Mon, 17 Sep 2012 11:05:18 +0100, dave.gma...@googlemail.com.invalid
> (Dave Gibson) wrote:
> [...]
>> ----script ends on previous line
>
> Thanks for that.
>
> When I tried it, I got this message:
>
> awk line 26:function not defined tolower

You seem to be operating on Solaris? Then use /usr/xpg4/bin/awk instead.

Janis

Ed Morton

unread,
Sep 18, 2012, 2:28:10 PM9/18/12
to
Steve Hayes <haye...@telkomsa.net> wrote:
<snip>
> When I tried it, I got this message:
>
> awk line 26:function not defined tolower


What OS are you running on? Also, please copy/paste the result of
running "awk
--version" so we know for sure what awk you have. Whatever it is, it's
not GNU
awk so I'd highly recommend you install that and use it from now on. It's
available at http://www.gnu.org/software/gawk/.

Ed.

Posted using www.webuse.net

Steve Hayes

unread,
Sep 18, 2012, 3:18:01 PM9/18/12
to
It's ancient -- Version 3.0 for DOS, but when I ran that I got something very
strange:

DOSPRN Print Spooler. Version 1.77
(c) 1990-2004 by Gurtjak D., Ignatenko I., Goldberg A.
Use extended memory: 200K
Use conventional memory: 4K

Kaz Kylheku

unread,
Sep 18, 2012, 4:06:11 PM9/18/12
to
On 2012-09-15, unruh <un...@invalid.ca> wrote:
> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>
>>>We need these heuristics to trim out the repetative garbage
>>>that even the minimalist lynx/links fetches from http-pages.
>>
>> I'd be interested in knowing whether awk could trim unwanted lines from the
>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>> subject and key words.
>
> Sure. I have no idea what the key for "keywords" is but for the rest
>
> awk ' BEGIN {H=1}
> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
> H==1 && $0 ~ /^$/ {H=0}
> H == 0 {print $0}' nameoffile
>
> should do it.

That "should do it", indeed, if you write the program only for your own use and
you're prepared, each time you use it, to validate that it didn't do anything
wrong (e.g. review a diff -u between input and output).

Stored e-mails are probably in the mbox format, which you have to parse
properly. Each message begins with "From " and there is a mechanism by which
this is escaped if it occurs in a message body. Within a message, the headers
are separated from the body by a blank line.

An e-mail header can be folded onto multiple physical lines, like this:

Subject: foo
bar

which is the same as:

Subject: foo bar

See RFC 2822 or whatever the latest one is now.

Dave Gibson

unread,
Sep 18, 2012, 6:54:42 PM9/18/12
to
Steve Hayes <haye...@telkomsa.net> wrote:
> On Mon, 17 Sep 2012 11:05:18 +0100, dave.gma...@googlemail.com.invalid
> (Dave Gibson) wrote:

[ at this point, hpat == "^(sender|from|date|subject|keywords):" ]

>>tolower($0) ~ hpat {

> Thanks for that.
>
> When I tried it, I got this message:
>
> awk line 26:function not defined tolower

----script begins on next line
BEGIN {
sender = "[Ss][Ee][Nn][Dd][Ee][Rr]"
from = "[Ff][Rr][Oo][Mm]"
date = "[Dd][Aa][Tt][Ee]"
subject = "[Ss][Uu][Bb][Jj][Ee][Cc][Tt]"
keywords = "[Kk][Ee][Yy][Ww][Oo][Rr][Dd][Ss]"
hpat = "^(" sender "|" from "|" date "|" subject "|" keywords "):"
}

/^\r?$/ { body = 1 }

body { print ; next }

hdr {
# brackets on next line contain a space and a tab
if ($0 ~ /^[ ]/) {
print
next
}
hdr = 0
}

$0 ~ hpat {

Steve Hayes

unread,
Sep 19, 2012, 5:41:26 AM9/19/12
to
On Tue, 18 Sep 2012 23:54:42 +0100, dave.gma...@googlemail.com.invalid
Thanks once again.

It worked fine on the first mail message, not so well on subsequent ones.

Here are the first three in the test file. The messages are separated by
"-- End --"

From: "Steve Hayes" <haye...@telkomsa.net>
Date: Sat, 15 Sep 2012 11:33:15 +0200
Subject: [SOUTH-AFRICA] Morris, Stewardson, Huskisson & other families
Sender: south-afri...@rootsweb.com

In the Cape Archives there is a document (A610) by William Charles Titterton
on the history of his family, which includes the TITTERTON, MORRIS and
HUSKISSON families who lived in the Cape Colony and traded in what is now
Namibia, and also St Helena. .

We looked at the document when we were in the Western Cape on holiday in 2003
and made notes from it. We were hoping to find some link with our family,
because we knew from books that my wife's 3g grandfather STEWARDSON had
married a MORRIS, mainly because the missionary CH HAHN wrote in his diary of
the arrival from the Bay of STEWARDSON with "his brother-in-law, the Wesleyan
trader MORRIS, with whom he lives in fierce enmity". HAHN, a Lutheran, was
deeply suspicious of Wesleyans, whom he regarded as heretics.

No first names, so it's not very helpful, but perhaps the "fierce enmity" is
why there is no mention of the Stewardsons in the Titterton manuscript.

But now we have found that Mrs STEWARDSON was indeed Frances MORRIS, and the
sister of James MORRIS the trader, who is definitely recorded in the
manuscript (he was the grandfather of the author).

So now that we have established a definite link with these fairly well-
documented families, I wonder if anyone else here is researching them.

If anyone is interested, there's more on our blog here:
http://t.co/m7y9R6mK


--
Keep well,
Steve Hayes
Web: http://hayesgreene.wordpress.com
http://hayesfam.posterous.com/
E-mail: sha...@dunelm.org.uk



-------------------------------
To unsubscribe from the list, please send an email to
SOUTH-AFRI...@rootsweb.com with the word 'unsubscribe' without the
quotes in the subject and the body of the message

-- End --
Return-Path:
SRS0+3d0f3d40735f1c46=HM=rootsweb.com=south-afri...@netcommunity1.com
Received: from cordelia.lb1.telkomsa.net (LHLO cordelia.telkomsa.net)
(192.168.222.61) by mail6.telkomsa.net with LMTP; Thu, 13 Sep 2012 06:46:39
+0200 (SAST)
Received: from localhost (localhost.localdomain [127.0.0.1])
by cordelia.telkomsa.net (Postfix) with ESMTP id 8BD342A8097
for <online...@telkomsa.net>; Thu, 13 Sep 2012 06:46:39 +0200
(SAST)
X-Virus-Scanned: amavisd-new at telkomsa.net
Received: from cordelia.telkomsa.net ([127.0.0.1])
by localhost (cordelia.telkomsa.net [127.0.0.1]) (amavisd-new, port
10024)
with ESMTP id hOUZ8ShKi83f for <online...@telkomsa.net>;
Thu, 13 Sep 2012 06:46:39 +0200 (SAST)
Received: from as7.telkomsa.net (unknown [196.25.211.210])
by cordelia.telkomsa.net (Postfix) with ESMTP id 6FD492A80A6
for <haye...@telkomsa.net>; Thu, 13 Sep 2012 06:46:39 +0200 (SAST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
AtUAAL1jUVDNi2kLmWdsb2JhbABCA4YHokGSNnQiAQEBAQEICwsHFBcQgiIBBQEBBRsREggCIgoDAQICBAIEIAIiBAICAgEBLRUBFgEHCwUDCQwBA4dqAgaoc5MkgSGKB4MGghIyYAONY4ImlBmERIFh
Received: from forward2.blackbaudservices2.com ([205.139.105.11])
by as7.telkomsa.net with ESMTP; 13 Sep 2012 06:39:32 +0200
Received: from lists2.rootsweb.com ([66.43.28.157])
by forward2.blackbaudservices2.com (IceWarp 10.3.4) with ESMTP id
XNH17434
for <sha...@dunelm.org.uk>; Thu, 13 Sep 2012 00:46:34 -0400
Received: from lists2.rootsweb.com (lists2.rootsweb.com [127.0.0.1])
by lists2.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8D4hIGe013895
for <sha...@dunelm.org.uk>; Wed, 12 Sep 2012 22:46:33 -0600
Received: from mail3.rootsweb.com (mail3.rootsweb.com [192.168.26.64])
by lists2.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8D4hHcB013888;
Wed, 12 Sep 2012 22:43:17 -0600
Received: from mail.rootsweb.com (mail.rootsweb.com [192.168.26.52])
by mail3.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8D4hHel032290;
Wed, 12 Sep 2012 22:43:17 -0600
Received: from cpt-ipcrelay08.mweb.co.za (cpt-ipcrelay08.mweb.co.za
[196.28.182.88])
by mail.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8D4hFc9030511;
Wed, 12 Sep 2012 22:43:16 -0600
Received: from 41-133-42-223.dsl.mweb.co.za ([41.133.42.223] helo=AspirePC)
by cpt-ipcrelay08.mweb.co.za with smtp (Exim 4.77)
id 1TC1Gd-0009yi-2l; Thu, 13 Sep 2012 06:43:12 +0200
Message-ID: <A454DF57DBE64494A3D0563E9D7580F3@AspirePC>
From: "Carol Beneke" <ca...@iafrica.com>
To: "Rootsweb - E.C." <south-africa...@rootsweb.com>,
"Rootsweb - S.A." <south-...@rootsweb.com>
Date: Thu, 13 Sep 2012 06:43:06 +0200
MIME-Version: 1.0
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 15.4.3555.308
X-MimeOLE: Produced By Microsoft MimeOLE V15.4.3555.308
X-Scanned-By: MIMEDefang 2.68 on 192.168.26.64
X-Content-Filtered-By: Mailman/MimeDel 2.1.7
Subject: [SOUTH-AFRICA] EASTERN PROVINCE HERALD, 15.03.1985
X-BeenThere: south-...@rootsweb.com
X-Mailman-Version: 2.1.7
Precedence: list
Reply-To: south-...@rootsweb.com
List-Id: <south-africa.rootsweb.com>
X-Loop: SOUTH-...@rootsweb.com
X-Member: SOUTH-...@rootsweb.com
List-Unsubscribe:
<http://lists2.rootsweb.ancestry.com/mailman/listinfo/south-africa>,
<mailto:south-afri...@rootsweb.com?subject=unsubscribe>
List-Archive:
<http://archiver.rootsweb.ancestry.com/th/index?list=south-africa>
List-Post: <mailto:south-...@rootsweb.com>
List-Help: <mailto:south-afri...@rootsweb.com?subject=help>
List-Subscribe:
<http://lists2.rootsweb.ancestry.com/mailman/listinfo/south-africa>,
<mailto:south-afri...@rootsweb.com?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: south-afri...@rootsweb.com
Errors-To: s+-outh-africa-bounce+shayes=dunelm...@rootsweb.com
X-Spam-Status: No, hits=1.38 required=5.00
tests=SPF_PASS,RATWARE_RCVD_BONUS_SPC,SUBJ_ALL_CAPS,MR_NOT_ATTRIBUTED_IP,NO_RDNS2
version=3.2.5
X-Spam-Level: *
X-Spam-Checker-Version: SpamAssassin 3.2.5 (1.1) on
forward2.blackbaudservices2.com
X-Antivirus: avast! (VPS 120912-1, 2012-09-12), Inbound message
X-Antivirus-Status: Clean
X-PMFLAGS: 34095104 0 1 PEPSO39Z.CNM
X-AC-Weight: [####] (Whitelisted) -9999
X-CC-Diagnostic:

RUFTVEVSTiBQUk9WSU5DRSBIRVJBTEQuCgpQb3J0IEVsaXphYmV0aCwgRnJpZGF5LCBNYXJjaCAx
NSwgMTk4NS4KCgoKQklSVEhTLgoKTWNHSUxMRVdJRS4g4oCTIFRvIExhcnJ5IGFuZCBTaGFyb24s
IGEgc29uIGFuZCBicm90aGVyIGZvciBBbXksIGF0IEZyZXJlIEhvc3BpdGFsLCBFYXN0IExvbmRv
biBvbiBNYXJjaCAxMy4KClRFVUNIRVJULiDigJMgVG8gUGV0ZXIgYW5kIEFubmUtTWFyaWUsIGEg
c29uIGJvcm4gTWFyY2ggMTMsIGF0IFNhbmRmb3JkLgoKCgpERUFUSFMuCgpHUkVHT1JZLiDigJMg
V2luaWZyZWQsIHBhc3NlZCBhd2F5IGF0IE11bnJvIEtpcmssIE1hcmNoIDExLCAxOTg1LCBhZ2Vk
IDk0IHllYXJzLCBtb3RoZXIgb2YgRWR3YXJkLCBHYXkgYW5kIGdyYW5kY2hpbGRyZW4gTG91aXNl
LCBIb3dhcmQsIERhdmlkLCByZWxhdGlvbnMgUmFscGggVFdFRVQsIERvZmZpZSBhbmQgV2luIE1B
Q0tFTlpJRS4KCk1VUlJBWS4g4oCTIEVsbGEsIGZyaWVuZCBvZiBJc2FiZWwgTEFLRVksIHBhc3Nl
ZCBhd2F5IGluIEVhc3QgTG9uZG9uLCBXZWRuZXNkYXksIE1hcmNoIDEzLiBTeW1wYXRoeSB0byBo
ZXIgc2lzdGVyIExlc2xleSBhbmQgTG91aXMgVEhPTVNPTi4KCk9XRU4uIOKAkyBQYXQsIHdpZmUg
b2YgUGVyY3ksIG1vdGhlciBvZiBDb3JhbCwgTHlubiwgUGV0ZXIgYW5kIGZhbWlsaWVzLCBwYXNz
ZWQgYXdheSBNYXJjaCAxMi4KClBBSVpJUy4gLSBDb3N0YSwgZGllZCB0cmFnaWNhbGx5IGF0IEdl
b3JnZSwgYXQgdGhlIGFnZSBvZiA0MSBvbiBUdWVzZGF5LCBNYXJjaCAxMy4gTW91cm5lZCBieSBo
aXMgd2lmZSBBbm5hIGFuZCBjaGlsZHJlbiBBbmdpZSwgTGV4aWUsIFN0YXRoaSBhbmQgVGluYS4g
U29uIG9mIENvdWxhLCBicm90aGVyIG9mIERpbWl0cmksIEdheWxlLCBQZW5lbG9wZSwgTWl0Y2gg
YW5kIE1hcnktR2lsbCBhbmQgdW5jbGUgQ2hyaXMgYW5kIExpc2EuCgoKCkZVTkVSQUwgTk9USUNF
Uy4KCkdSRUdPUlkuIOKAkyBXaW5pZnJlZCwgcGFzc2VkIGF3YXkgYXQgTXVucm8gS2lyaywgTW9u
ZGF5LCBNYXJjaCAxMSwgMTk4NSwgYWdlZCA5NCB5ZWFycy4gU2VydmljZSBmcm9tIFN0IEpvaG7i
gJlzIEFuZ2xpY2FuIENodXJjaCwgV2FsbWVyLCBvbiBNb25kYXksIE1hcmNoIDE4LCBhdCAxMS4z
MCBhbS4gSW50ZXJtZW50IGF0IHRoZSBXYWxtZXIgQ2VtZXRlcnkuCgpLTk9FVFpFLiDigJMgT2Jl
ZCwgcGFzc2VkIGF3YXkgaW4gQ2FwZSBUb3duIG9uIE1hcmNoIDExLCAxOTg1LiBGdW5lcmFsIHNl
cnZpY2Ugd2lsbCB0YWtlIHBsYWNlIG9uIFNhdHVyZGF5LCBNYXJjaCAxNiwgMTk4NSBhdCAxMSBh
bSBmcm9tIHRoZSBELlIuIENodXJjaCwgTW9lZGVyZ2VtZWVudGUsIFVpdGVuaGFnZS4KCk1FSU5U
SklFUy4g4oCTIEthcmwsIHBhc3NlZCBhd2F5IG9uIE1hcmNoIDEzLCBhdCB0aGUgYWdlIG9mIDgx
IHllYXJzLiBTZXJ2aWNlIHRvZGF5LCBGcmlkYXksIE1hcmNoIDE1IGF0IHRoZSBQRSBDcmVtYXRv
cml1bSBhdCAzIHBtLgoKUEFJWklTLiDigJMgQ29zdGEsIGZ1bmVyYWwgYXQgMyBwbSBhdCB0aGUg
R3JlZWsgT3J0aG9kb3ggQ2h1cmNoLCBDb255bmdoYW0gU3RyZWV0LCBQYXJzb25zIEhpbGwuCgpW
QU4gQlVDSEVOUk9ERVIuIOKAkyBDaGFybGVzIEFsZHJpZGdlLCB0aGUgc2VydmljZSB3aWxsIHRh
a2UgcGxhY2UgYXQgdGhlIFNldmVudGggRGF5IEFkdmVudGlzdCBDaHVyY2gsIENvb21icyBSb2Fk
LCBCZXRoZWxzZG9ycCwgYXQgNCBwbSwgU2F0dXJkYXksIHRoZW5jZSB0byB0aGUgQmV0aGVsc2Rv
cnAgQ2VtZXRlcnkuCgoKCgpUcmFuc2NyaWJlZCBieQpDYXJvbCBCRU5FS0UgbmVlIFNURVdBUlQK
UG9ydCBFbGl6YWJldGgsIFNvdXRoIEFmcmljYS4KMDgzIDQ4MiAxNDgyCmNhcm1lQGlhZnJpY2Eu
Y29tCl9fX19fX19fX19fX19fX19fX19fX19fXwpSZXNlYXJjaGluZzogU1RFV0FSVCwgU1RPTkUs
IExVS0UsIE9MSVZJRVIsIEJFTkVLRSwgQkVORUNLRSwgVk9OIEJFTkVDS0UgCmFuZCByZWxhdGVk
IGZhbWlsaWVzLgoKRmFjZWJvb2sgZ3JvdXAgLSBCZW5la2UgQmVuZWNrZSB2b24gQmVuZWNrZSBp
biBTb3V0aCBBZnJpY2EuCiAKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpUbyB1bnN1
YnNjcmliZSBmcm9tIHRoZSBsaXN0LCBwbGVhc2Ugc2VuZCBhbiBlbWFpbCB0byBTT1VUSC1BRlJJ
Q0EtcmVxdWVzdEByb290c3dlYi5jb20gd2l0aCB0aGUgd29yZCAndW5zdWJzY3JpYmUnIHdpdGhv
dXQgdGhlIHF1b3RlcyBpbiB0aGUgc3ViamVjdCBhbmQgdGhlIGJvZHkgb2YgdGhlIG1lc3NhZ2U=

-- End --
Return-Path:
SRS0+4e66fb160e0b5875=HL=rootsweb.com=south-afri...@netcommunity1.com
Received: from juliet.lb1.telkomsa.net (LHLO juliet.telkomsa.net)
(192.168.222.58) by mail6.telkomsa.net with LMTP; Wed, 12 Sep 2012 06:58:42
+0200 (SAST)
Received: from as4.telkomsa.net (unknown [196.25.211.210])
by juliet.telkomsa.net (Postfix) with ESMTP id 50F282584E0
for <haye...@telkomsa.net>; Wed, 12 Sep 2012 06:58:42 +0200 (SAST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
AiABAAQVUFDNi2kLmWdsb2JhbABCA4YHohuSPXgiAQEBAQEICwsHFBcQgiIBBQEBBRsREggCLAMBAgIEAgQgAiIEAgICAQEtFQEWAQcLBQMJDASHagIGqF2TMYEhigeDBoISMmADjWOCJoV0i1uCSoREgWE
Received: from forward2.blackbaudservices2.com ([205.139.105.11])
by as4.telkomsa.net with ESMTP; 12 Sep 2012 06:48:48 +0200
Received: from lists2.rootsweb.com ([66.43.28.157])
by forward2.blackbaudservices2.com (IceWarp 10.3.4) with ESMTP id
WNS12936
for <sha...@dunelm.org.uk>; Wed, 12 Sep 2012 00:58:36 -0400
Received: from lists2.rootsweb.com (lists2.rootsweb.com [127.0.0.1])
by lists2.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8C4uauq010945
for <sha...@dunelm.org.uk>; Tue, 11 Sep 2012 22:58:35 -0600
Received: from mail3.rootsweb.com (mail3.rootsweb.com [192.168.26.64])
by lists2.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8C4rUXg009642;
Tue, 11 Sep 2012 22:53:30 -0600
Received: from mail.rootsweb.com (mail.rootsweb.com [192.168.26.52])
by mail3.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8C4rTJP030824;
Tue, 11 Sep 2012 22:53:29 -0600
Received: from cpt-ipcrelay06.mweb.co.za (cpt-ipcrelay06.mweb.co.za
[196.28.182.86])
by mail.rootsweb.com (8.13.8/8.13.8) with ESMTP id q8C4rO2I011207;
Tue, 11 Sep 2012 22:53:28 -0600
Received: from 41-135-37-143.dsl.mweb.co.za ([41.135.37.143] helo=AspirePC)
by cpt-ipcrelay06.mweb.co.za with smtp (Exim 4.77)
id 1TBewt-000Mrz-MU; Wed, 12 Sep 2012 06:53:22 +0200
Message-ID: <ADCD1131D7714C25A6F90D1584606E2F@AspirePC>
From: "Carol Beneke" <ca...@iafrica.com>
To: "Rootsweb - E.C." <south-africa...@rootsweb.com>,
"Rootsweb - S.A." <south-...@rootsweb.com>
Date: Wed, 12 Sep 2012 06:53:14 +0200
MIME-Version: 1.0
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 15.4.3555.308
X-MimeOLE: Produced By Microsoft MimeOLE V15.4.3555.308
X-Scanned-By: MIMEDefang 2.68 on 192.168.26.64
X-Content-Filtered-By: Mailman/MimeDel 2.1.7
Subject: [SOUTH-AFRICA] EASTERN PROVINCE HERALD, 25.02.1985
X-BeenThere: south-...@rootsweb.com
X-Mailman-Version: 2.1.7
Precedence: list
Reply-To: south-...@rootsweb.com
List-Id: <south-africa.rootsweb.com>
X-Loop: SOUTH-...@rootsweb.com
X-Member: SOUTH-...@rootsweb.com
List-Unsubscribe:
<http://lists2.rootsweb.ancestry.com/mailman/listinfo/south-africa>,
<mailto:south-afri...@rootsweb.com?subject=unsubscribe>
List-Archive:
<http://archiver.rootsweb.ancestry.com/th/index?list=south-africa>
List-Post: <mailto:south-...@rootsweb.com>
List-Help: <mailto:south-afri...@rootsweb.com?subject=help>
List-Subscribe:
<http://lists2.rootsweb.ancestry.com/mailman/listinfo/south-africa>,
<mailto:south-afri...@rootsweb.com?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: south-afri...@rootsweb.com
Errors-To: s+-outh-africa-bounce+shayes=dunelm...@rootsweb.com
X-Spam-Status: No, hits=1.38 required=5.00
tests=SPF_PASS,RATWARE_RCVD_BONUS_SPC,SUBJ_ALL_CAPS,MR_NOT_ATTRIBUTED_IP,NO_RDNS2
version=3.2.5
X-Spam-Level: *
X-Spam-Checker-Version: SpamAssassin 3.2.5 (1.1) on
forward2.blackbaudservices2.com
X-Antivirus: avast! (VPS 120911-1, 2012-09-11), Inbound message
X-Antivirus-Status: Clean
X-PMFLAGS: 34095232 0 1 P68I3I39.CNM
X-AC-Weight: [####] (Whitelisted) -9999
X-CC-Diagnostic:

CgpFQVNURVJOIFBST1ZJTkNFIEhFUkFMRAoKUG9ydCBFbGl6YWJldGgsICBNb25kYXksIEZlYnJ1
YXJ5IDI1LCAxOTg1LgoKCgpCSVJUSFMuCgpEVSBQUkVFWi4g4oCTIFRvIEJvYmJ5IGFuZCBTYW5k
cmEsIGEgZGF1Z2h0ZXIgb24gRmVicnVhcnkgMjIsIHNpc3RlciBmb3IgRGFycnlsLCBBbnRob255
IGFuZCBNb25pcXVlIGFuZCBncmFuZGRhdWdodGVyIGZvciBQZXRlciBhbmQgTm9yZWVuIE1FTFZJ
TExFLiAKCkZVUlNURU5CRVJHLiDigJMgVG8gUm95IGFuZCBQYXR0eSwgYSBkYXVnaHRlciwgYm9y
biBvbiBGZWJydWFyeSAyMCwgMTk4NSwgYXQgU2FuZm9yZC4KCkdJTENIUklTVC4g4oCTIFRvIFNw
ZWNrIGFuZCBIZW5yaWV0dGUgKG5lZSBNQVJBSVMpLCBhIGRhdWdodGVyLCBTYW5kdG9uIENsaW5p
YywgSm9oYW5uZXNidXJnLCBGZWJydWFyeSAyMS4KCk1BVEhFU09OLiDigJMgVG8gRG9uIGFuZCBS
aWNreSwgYSBkYXVnaHRlciBhbmQgc2lzdGVyIGZvciBHYXJyaW4sIEZlYnJ1YXJ5IDIyLCBhdCBT
YW5mb3JkLgoKTUlMTFMuIOKAkyBUbyBDaHJpc3RvcGhlciBhbmQgQmFyYmFyYSAobmVlIFZBTiBE
RSBNRVJXRSksIGEgc29uIGJvcm4gRmVicnVhcnkgMjEsIDE5ODUuCgpQQURERVkuIOKAkyBUbyBC
cmlhbiBhbmQgSmVubnkgKG5lZSBCVVhUT04pLCBhIHNvbiwgU0VBTiwgYm9ybiBGZWJydWFyeSAy
NCwgMTk4NS4gIEdyYW5kc29uIGZvciBTaWQgYW5kIERhcGhuZSwgQnJpYW4gYW5kIE1vbGx5LgoK
CgpFTkdBR0VNRU5UUy4KCkZPUkQg4oCTIEJBTkZJRUxELiDigJMgTXVyaWVsIGFuZCBEdWRkYSBo
YXZlIHBsZWFzdXJlIGluIGFubm91bmNpbmcgdGhlIGVuZ2FnZW1lbnQgb2YgdGhlaXIgeW91bmdl
ciBzb24sIEplZmZyZXksIHRvIEx5bm4sIHNlY29uZCBlbGRlc3QgZGF1Z2h0ZXIgb2YgTm9lbCBh
bmQgSW5hIEJBTkZJRUxEIG9mIEdyYWhhbXN0b3duLgoKR09SRE9OIOKAkyBCRVJOU1RFSU4uIOKA
kyBTYWxseSBhbmQgRnJlZGEgb2YgVWl0ZW5oYWdlIGhhdmUgcGxlYXN1cmUgaW4gYW5ub3VuY2lu
ZyB0aGUgZW5nYWdlbWVudCBvZiB0aGVpciBlbGRlc3Qgc29uLCBEYXZpZCwgdG8gS2F0aHksIHlv
dW5nZXN0IGRhdWdodGVyIG9mIEFydGh1ciBhbmQgRGVuaXNlIG9mIENhcGUgVG93bi4KCk1FSUhV
SVpFTiDigJMgREFWSURTT04uIOKAkyBTdGV2ZSBhbmQgUGFtIGFyZSBoYXBweSB0byBhbm5vdW5j
ZSB0aGUgZW5nYWdlbWVudCBvZiB0aGVpciBkYXVnaHRlciBJbGFyZXIgdG8gR3JhaGFtLCBzb24g
b2YgTWFjIGFuZCBFdW5pY2UsIG9mIE1pbG5lcnRvbiwgQ2FwZS4KClJJQ0hBUkRTT04g4oCTIERB
VU5FWS4g4oCTIFRoZSBlbmdhZ2VtZW50IGlzIGFubm91bmNlZCBpbiBMb25kb24sIGJldHdlZW4g
QW50aG9ueSwgZWxkZXIgc29uIG9mIFNpciBMZXNsaWUgUklDSEFSRFNPTiBCQVJUIGFuZCBMYWR5
IFJJQ0hBUkRTT04sIE9sZCBWaW5leWFyZCwgQ29uc3RhbnRpYSwgQ2FwZSwgYW5kIEdpbGxpYW4s
IGVsZGVyIGRhdWdodGVyIG9mIE1yIGFuZCBNcnMgQW50b255IERBVU5FWSwgb2YgUGFkZGluZ3Rv
biwgU3lkbmV5LCBBdXN0cmFsaWEuCgoKCk1BUlJJQUdFUy4KCkFOREVSU09OIOKAkyBTTEVNRU5U
LiDigJMgTm9ybWFuIGFuZCBDaGFybWFpbmUgU0xFTUVOVCBvZiBBZGRvLCBhcmUgcGxlYXNlZCB0
byBhbm5vdW5jZSB0aGUgbWFycmlhZ2Ugb2YgdGhlaXIgb25seSBkYXVnaHRlciwgTWFyeSwgdG8g
TWFyaywgb25seSBzb24gVmFsIERBVklTIGFuZCB0aGUgbGF0ZSBXYWxseSBBTkRFUlNPTiwgb2Yg
UG9ydCBFbGl6YWJldGgsIG9uIEZlYnJ1YXJ5IDIzLCAxOTg1LgoKCgpERUFUSFMuCgpHQVJSRVRU
LiDigJMgQWxpc29uLCBwYXNzZWQgYXdheSBvbiBGcmlkYXksIEZlYnJ1YXJ5IDIyLiBIdXNiYW5k
IG9mIEVsYWluZSBhbmQgZmF0aGVyIG9mIEFpbGVlbiwgU2hpcmxleSwgRGF2aWQsIFBhdWxpbmUg
YW5kIEFubmUuCgpHQVRMRVkuIOKAkyBKdW5lIEF1ZHJleSwgcGFzc2VkIGF3YXkgb24gU2F0dXJk
YXksIEZlYnJ1YXJ5IDIzLiAgV2lmZSBvZiBMYXdyZW5jZSwgYW5kIG1vdGhlciBvZiBSb2JpbiBh
bmQgTWFyZ3Vlcml0ZSBHQVRMRVkgYW5kIGdyYW5ueSB0byBFbGl6YWJldGggYW5kIEhlbGVuLiAg
TWlzc2VkIGJ5IEpvaG4sIFdlbmEgYW5kIGNoaWxkcmVuLCBhcyB3ZWxsIGFzIGhlciBkYXVnaHRl
ciBBbnRoZWEgYW5kIGJyb3RoZXIgQmlsbHksIGZyaWVuZHMgU2hhdW4sIEphY2tpZSwgSmVubmlm
ZXIgYW5kIE/igJlORUlMTCBmYW1pbHksIHNpc3RlciBNYXJpYSBhbmQgQ3lyaWwuCgpIT1NLSU4u
IOKAkyBFcm5lc3QgcGFzc2VkIGF3YXkgRmVicnVhcnkgMjIuICBNaXNzZWQgYnkgaGlzIGJyb3Ro
ZXIgQ2xpZmZvcmQgYW5kIElkYSwgYW5kIENsaWZmb3JkLCBNYXJsZW5lIGFuZCBmYW1pbHkuCgpM
SUVCT1dJVFouIOKAkyBIeW1pZSwgcGFzc2VkIGF3YXkgb24gU3VuZGF5LCBGZWJydWFyeSAyNC4g
IE1pc3NlZCBieSBoaXMgd2lmZSwgSm9zZXBoaW5lLCBjaGlsZHJlbiBTb2xseSwgRmVsaWNpdHks
IExpb25lbCwgRGVuaXNlLCBKYW5ldCwgTWFydGluIGFuZCBncmFuZGNoaWxkcmVuLgoKTkVBTUUu
IOKAkyBTeWJpbCBDb25zdGFuY2UsIHBhc3NlZCBhd2F5IG9uIEZlYnJ1YXJ5IDIyLCAxOTg1LiBX
aWZlIG9mIFJleCAoVGlueSkgYW5kIGZhbWlsaWVzLgoKUEFSS0lOLiDigJMgTWFyeSwgcGFzc2Vk
IGF3YXkgb24gRmVicnVhcnkgMjQuIE1vdXJuZWQgYnkgaGVyIGNoaWxkcmVuLCBSb3kgYW5kIFZh
bCBhbmQgdGhlaXIgZmFtaWxpZXMuCgpQUklOQ0UuIOKAkyBDYXJvbGluZSAoQ2FycnkpIChuZWUg
VklTQUdJRSksIHBhc3NlZCBhd2F5IG9uIEZlYnJ1YXJ5IDIxLiAgTW91cm5lZCBieSBTYXJhaCBh
bmQgZGF1Z2h0ZXJzLCBhbmQgU2FsbHkgRFUgUExFU1NJUywgRGF2aWQsIEpvYW4sIGFuZCBjaGls
ZHJlbiAoQ2FwZSBUb3duKS4KClJFRVZFUy1NT09SRS4g4oCTIEluYSwgbW90aGVyIG9mIEZheWUs
IEhlYXRoZXIgYW5kIFZhbGVyaWUsIGdyYW5kbW90aGVyIG9mIFJpY2hhcmQsIEdyYWVtZSwgQ2Fy
b2xpbmUsIE1pY2hhZWwsIFBlbm55LCBHaWxsaWFuLCBTdXNhbiBhbmQgSm9uYXRoYW4sIHBhc3Nl
ZCBhd2F5IEZlYnJ1YXJ5IDI0LgoKV0FUS0lOUy4g4oCTIEFsZiwgc2FkbHkgbWlzc2VkIGJ5IENs
aXZlLCBKZWFuLCBTaGFuZSwgUmVlcywgS2VyeSwgU2ltb24sIEx5bmxleSwgSmFuaWNlLCBHcmFu
dCwgVHJhY3kgYW5kIEJyYWRsZXkgYW5kIGFsbCBhdCBFUEJVQS4KCldJTFNPTi4g4oCTIEdlb3Jn
ZSwgaHVzYmFuZCBvZiBCcmVuZGEgYW5kIGZhdGhlciBvZiBKZW5uaWZlciwgU3VzYW4gYW5kIENh
cm9sLCBwYXNzZWQgYXdheSBpbiBHcmFoYW1zdG93biBvbiBGZWJydWFyeSwgMjQsIGluIEdyYWhh
bXN0b3duLgoKCgpGVU5FUkFMIE5PVElDRVMuCgpCUkFETEVZLiDigJMgU3VzYW5uYSBSb2VsZmlu
YSAoRklFTklFKSBwYXNzZWQgYXdheSBGZWJydWFyeSAyMywgMTk4NS4gRnVuZXJhbCBzZXJ2aWNl
IHdpbGwgdGFrZSBwbGFjZSBXZWRuZXNkYXksIEZlYnJ1YXJ5IDI3LCBSZWZvcm1lZCBDaHVyY2gs
IDR0aCBBdmVudWUsIFdhbG1lci4KCkdBUlJFVFQuIOKAkyBSZXYgQWxpc29uIEVkd2FyZCBGb3Ji
ZXMsIGFnZWQgODQgeWVhcnMsIHBhc3NlZCBhd2F5IGF0IHRoZSBQcm92aW5jaWFsIEhvc3BpdGFs
IG9uIEZlYnJ1YXJ5IDIyLiAgU2VydmljZSBpbiBTdCBKb2hu4oCZcyBNZXRob2Rpc3QgQ2h1cmNo
LCBIYXZlbG9jayBTdHJlZXQsIG9uIE1vbmRheSwgRmVicnVhcnkgMjUsIGF0IDEyIG5vb24uIENy
ZW1hdGlvbiBwcml2YXRlLiAgCgpIT1NLSU5HLiDigJMgRXJuZXN0LCBwYXNzZWQgYXdheSBhdCB0
aGUgUHJvdmluY2lhbCBIb3NwaXRhbCBVaXRlbmhhZ2Ugb24gRmVicnVhcnkgMjIsIDE5ODUuIEZ1
bmVyYWwgc2VydmljZSB3aWxsIHRha2UgcGxhY2UgYXQgdGhlIEJhcHRpc3QgQ2h1cmNoLCBEb2Rk
IFJvYWQsIFVpdGVuaGFnZSwgdG9tb3Jyb3csIFR1ZXNkYXksIEZlYnJ1YXJ5IDI1LCBhdCAxMSBh
bS4KCk1PTExFUi4g4oCTIEJhc2lsLCBwYXNzZWQgYXdheSBhdCB0aGUgUHJvdmluY2lhbCBIb3Nw
aXRhbCBvbiBTYXR1cmRheSBhdCB0aGUgYWdlIG9mIDc4IHllYXJzLiAgU2VydmljZSBpbiBTdCBT
YXZpb3VycyBBbmdsaWNhbiBDaHVyY2gsIFdhbG1lciBvbiBUdWVzZGF5LCBGZWJydWFyeSAyNiwg
YXQgMTAgYW0uIENyZW1hdGlvbiBwcml2YXRlLgoKUEFSS0lOLiDigJMgTWFyeSwgcGFzc2VkIGF3
YXkgb24gU3VuZGF5LCBGZWJydWFyeSAyNCwgYXQgdGhlIGFnZSBvZiA4NS4gU2VydmljZSBpbiBT
dCBKb2hu4oCZcyBBbmdsaWNhbiBDaHVyY2gsIFdhbG1lciwgdG9tb3Jyb3csIFR1ZXNkYXksIEZl
YnJ1YXJ5IDI2LCBhdCAxMS4zMCBhbS4gQ3JlbWF0aW9uIHByaXZhdGUuCgpQUklOQ0UuIOKAkyBD
YXJvbGluZSBGcmFuY2VzIChDYXJyeSkgKG5lZSBWSVNBR0lFKSwgcGFzc2VkIGF3YXkgb24gVGh1
cnNkYXksIEZlYnJ1YXJ5IDIxIGF0IHRoZSBhZ2Ugb2YgNTAsIHdpZmUgb2YgSm9obiwgbW90aGVy
IG9mIENhcm9sLCBKb2xlbmUsIE1hdXJlZW4gYW5kIEZpb25hLCBzb25zLWluLWxhdyBNZWx2aW4g
YW5kIFJldWJlbiwgIGdyYW5kbW90aGVyIG9mIExsb3lkLCBMdWNpbmRhIGFuZCBMaWV6ZWxsZS4g
RnVuZXJhbCBvbiBNb25kYXksIEZlYnJ1YXJ5IDI1LCAxOTg1LCBhdCB0aGUgTmV3IEFwb3N0b2xp
YyBDaHVyY2gsIEJlbGwgUm9hZCwgR2VsdmFuZGFsZSwgYXQgMyBwbS4gVGhlbmNlIHRvIFBhYXBl
bmt1aWxzIENlbWV0ZXJ5LgoKUFVMTEVZLiAtIEpvaG4sIHBhc3NlZCBhd2F5IG9uIFRodXJzZGF5
LCwgYXQgdGhlIGFnZSBvZiA2OS4gU2VydmljZSBpbiBOYXphcmV0aCBIb3VzZSBDaGFwZWwsIHRv
bW9ycm93LCBUdWVzZGF5LCBGZWJydWFyeSAyNiwgYXQgMTAgYW0uCgpWQU4gQlJVR0dFTi4g4oCT
IFdpbGxlbSBKYW4sIHZhbiBEZWxhcmV5c3RyYWF0IDEzLCBEZXNwYXRjaCwgc2FnIGhlZW5nZWdh
YW4uIEJlZ3JhZm5pcyB2YW5kYWcsIE1hYW5kYWcgMjUgRmVicnVhcmllLCAxOTg1LCBvbSAzbm0g
dmFudWl0IE4uRy4gS2VyaywgRWVuZHJhZywgRGVzcGF0Y2guCgpWRVJNQUFLLiDigJMgSm9oYW5u
YSBFbGl6YWJldGggdmFuIFN0cmFhdGVuLCBwYXNzZWQgYXdheSBvbiBGZWJydWFyeSAyMCwgMTk4
NS4gRnVuZXJhbCBzZXJ2aWNlIHdpbGwgdGFrZSBwbGFjZSB0b21vcnJvdywgVHVlc2RheSwgRmVi
cnVhcnkgMjYsIDE5ODUsIGF0IDIuMzAgcG0gZnJvbSB0aGUgRC5SLiBDaHVyY2gsIE1vd2JyYXkg
U3RyZWV0LCBOZXd0b24gUGFyay4KCldBVEtJTlMuIOKAkyBBbGYsIHBhc3NlZCBhd2F5IGF0IHRo
ZSBQcm92aW5jaWFsIEhvc3BpdGFsIG9uIFRodXJzZGF5IGF0IHRoZSBhZ2Ugb2YgNzkuICBTZXJ2
aWNlIGluIFN0IEh1Z2jigJlzIEFuZ2xpY2FuIENodXJjaCwgTmV3dG9uIFBhcmsgb24gTW9uZGF5
LCBGZWJydWFyeSAyNSBhdCAxMS4zMCBhbS4gQ3JlbWF0aW9uIHByaXZhdGUuCgoKCklOIE1FTU9S
SUFNLgoKQlJBRFNIQVcuIOKAkyBKb2huIEh1Z2gsIHBhc3NlZCBhd2F5IDIgeWVhcnMgYWdvLiBS
ZW1lbWJlcmVkIGJ5IFBhdGllbmNlLCBKb2huLCBJc2FiZWxsZSwgTWVsdmlsbGUsIEplYW5ldHRl
LCBTdHVhcnQsIE5pY2hvbGFzIGFuZCBMeW5sZXkuCgpLSVZFVFRTLiDigJMgUC5GLiwgcGFzc2Vk
IGF3YXkgRmVicnVhcnkgMjUsIDE5ODQuIFJlbWVtYmVyZWQgYnkgaGlzIHdpZmUgQ2F0aCBhbmQg
Y2hpbGRyZW4gYWxzbyBEYW5pZWwsIE1hdXJlZW4sIGFuZCBjaGlsZHJlbiwgQ2FwZSBUb3duLgoK
TEVWRVkuIC0gUGVnZ3ksIHBhc3NlZCBhd2F5IEZlYnJ1YXJ5IDI1LCAxOTgyLiBSZW1lbWJlcmVk
IGJ5IFN5ZG5leSwgUGV0ZXIgYW5kIEx5bm4sIEplZmZyZXkgYW5kIFZpY2t5LCBTdWUgYW5kIFJp
Y2hhcmQuCgpP4oCZS0VOTkVEWS4gSGF6ZWwsIHBhc3NlZCBhd2F5IE1hcmNoIDI1LCAxOTgzLiAg
UmVtZW1iZXJlZCBieSBoZXIgaHVzYmFuZCBFcm5lc3QgKE8uSy4pLCBjaGlsZHJlbiBKYW1lcywg
SmVubnksIEtlaXRoIGFuZCBSb3MsIGdyYW5kY2hpbGRyZW4gS2ltLCBXYXJyaWNrLCBOaWtraSwg
Umlja3kgYW5kIExlaWdoLgoKCgpUcmFuc2NyaWJlZCBieQpDYXJvbCBCRU5FS0UgbmVlIFNURVdB
UlQKUG9ydCBFbGl6YWJldGgsIFNvdXRoIEFmcmljYS4KMDgzIDQ4MiAxNDgyCmNhcm1lQGlhZnJp
Y2EuY29tCl9fX19fX19fX19fX19fX19fX19fX19fXwpSZXNlYXJjaGluZzogU1RFV0FSVCwgU1RP
TkUsIExVS0UsIE9MSVZJRVIsIEJFTkVLRSwgQkVORUNLRSwgVk9OIEJFTkVDS0UgCmFuZCByZWxh
dGVkIGZhbWlsaWVzLgoKRmFjZWJvb2sgZ3JvdXAgLSBCZW5la2UgQmVuZWNrZSB2b24gQmVuZWNr
ZSBpbiBTb3V0aCBBZnJpY2EuCiAKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpUbyB1
bnN1YnNjcmliZSBmcm9tIHRoZSBsaXN0LCBwbGVhc2Ugc2VuZCBhbiBlbWFpbCB0byBTT1VUSC1B
RlJJQ0EtcmVxdWVzdEByb290c3dlYi5jb20gd2l0aCB0aGUgd29yZCAndW5zdWJzY3JpYmUnIHdp
dGhvdXQgdGhlIHF1b3RlcyBpbiB0aGUgc3ViamVjdCBhbmQgdGhlIGJvZHkgb2YgdGhlIG1lc3Nh
Z2U=

-- End --


I'm not sure what that X-CC-Diagnistic thingy is. It seems big.

Dave Gibson

unread,
Sep 19, 2012, 7:56:13 AM9/19/12
to
Steve Hayes <haye...@telkomsa.net> wrote:
> On Tue, 18 Sep 2012 23:54:42 +0100, dave.gma...@googlemail.com.invalid
> (Dave Gibson) wrote:
>
>>----script begins on next line
>>BEGIN {
>> sender = "[Ss][Ee][Nn][Dd][Ee][Rr]"
>> from = "[Ff][Rr][Oo][Mm]"
>> date = "[Dd][Aa][Tt][Ee]"
>> subject = "[Ss][Uu][Bb][Jj][Ee][Cc][Tt]"
>> keywords = "[Kk][Ee][Yy][Ww][Oo][Rr][Dd][Ss]"
>> hpat = "^(" sender "|" from "|" date "|" subject "|" keywords "):"
>>}

# Insert the following line here
/^-- End --/ { print ; body = 0 ; next }

>>
>>/^\r?$/ { body = 1 }
>>
>>body { print ; next }
>>
>>hdr {
>> # brackets on next line contain a space and a tab
>> if ($0 ~ /^[ ]/) {
>> print
>> next
>> }
>> hdr = 0
>>}
>>
>>$0 ~ hpat {
>> print
>> hdr = 1
>>}
>>----script ends on previous line
>
> Thanks once again.
>
> It worked fine on the first mail message, not so well on subsequent
> ones.
>
> Here are the first three in the test file. The messages are separated
> by "-- End --"

The above edit assumes that there are no empty lines between the
"-- End --" and the first header of the next message. If that is wrong,
insert the following line instead:

/^[Rr][Ee][Tt][Uu][Rr][Nn]-[Pp][Aa][Tt][Hh]:/ { body = 0 }

> -- End --
> Return-Path:
[...]
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: base64
[...]
> X-CC-Diagnostic:
>
> RUFTVEVSTiBQUk9WSU5DRSBIRVJBTEQuCgpQb3J0IEVsaXphYmV0aCwgRnJpZGF5LCBNYXJjaCAx
[...]
> I'm not sure what that X-CC-Diagnistic thingy is. It seems big.

The gibberish is the message body encoded as base64 -- it's not associated
with a specific header.

unruh

unread,
Sep 19, 2012, 1:11:08 PM9/19/12
to
On 2012-09-18, Kaz Kylheku <k...@kylheku.com> wrote:
> On 2012-09-15, unruh <un...@invalid.ca> wrote:
>> On 2012-09-15, Steve Hayes <haye...@telkomsa.net> wrote:
>>> On Mon, 10 Sep 2012 16:46:38 +0000 (UTC), no.to...@gmail.com wrote:
>>>
>>>>We need these heuristics to trim out the repetative garbage
>>>>that even the minimalist lynx/links fetches from http-pages.
>>>
>>> I'd be interested in knowing whether awk could trim unwanted lines from the
>>> headers of saved e-mail messages, to remove all but senter, recipient, date,
>>> subject and key words.
>>
>> Sure. I have no idea what the key for "keywords" is but for the rest
>>
>> awk ' BEGIN {H=1}
>> H == 1 && ( $0 ~/^Subject:/ || $0 ~ /^To:/ || $0 ~ /^From:/) {print $0}
>> H==1 && $0 ~ /^$/ {H=0}
>> H == 0 {print $0}' nameoffile
>>
>> should do it.
>
> That "should do it", indeed, if you write the program only for your own use and
> you're prepared, each time you use it, to validate that it didn't do anything
> wrong (e.g. review a diff -u between input and output).

Yes, my example was supposed to be a very quick and dirty program to
demonstrate that "awk could do it"

It is for a single email in a file, not for a whole sequence of them.
I guess otherwise one could put a
$0~/^From / {H=1} into the list of "conditions/responses"
to reset to header mode if it sees a line beginning with "From "


Capturing the full lines of the headers even if they stretched over more
than one line is more difficult and I am sure not going to spend time
thinking about it since the OP never said why he wanted this, or whether
it was more than simply a passing curiosity to him. You are welcome to
do it if you care to.

Steve Hayes

unread,
Sep 19, 2012, 3:37:47 PM9/19/12
to
On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:

>Capturing the full lines of the headers even if they stretched over more
>than one line is more difficult and I am sure not going to spend time
>thinking about it since the OP never said why he wanted this, or whether
>it was more than simply a passing curiosity to him. You are welcome to
>do it if you care to.

I've been wanting to do something like this for 20 years, and when I saw AWK
and its description I thought it might be able to do something like this, but
I didn't see how.

When someone asked if awk could perform a somewhat similar task, and there
appeared to be some awk fundis who knew how to make the thing work, I then
asked if it could do what I wanted it to do - in other words remove extraneous
headers from saved e-mail messages, which would make it easier to import them
into a database.

As I said elsewhere, in spite of having a version of awk lurking on my
computer for 20 years or so, I've never known how to used it, and I'm a
complete novice, but I hope to learn something from those who do know how to
use it.

Steve Hayes

unread,
Sep 19, 2012, 3:46:47 PM9/19/12
to
On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma...@googlemail.com.invalid
(Dave Gibson) wrote:

> # Insert the following line here
>/^-- End --/ { print ; body = 0 ; next }

Brilliant, thanks.

>The gibberish is the message body encoded as base64 -- it's not associated
>with a specific header.

Ah, yes, with that addition I can see that.

Ed Morton

unread,
Sep 19, 2012, 4:26:06 PM9/19/12
to
Steve Hayes <haye...@telkomsa.net> wrote:

> On Tue, 18 Sep 2012 18:28:10 GMT, "Ed Morton" <morto...@gmail.com> wrote:
>
> >Steve Hayes <haye...@telkomsa.net> wrote:
> ><snip>
> >> When I tried it, I got this message:
> >>
> >> awk line 26:function not defined tolower
> >
> >
> >What OS are you running on? Also, please copy/paste the result of
> >running "awk
> >--version" so we know for sure what awk you have. Whatever it is, it's
> >not GNU
> >awk so I'd highly recommend you install that and use it from now on. It's
> >available at http://www.gnu.org/software/gawk/.
>
> It's ancient -- Version 3.0 for DOS, but when I ran that I got something very
> strange:
>
> DOSPRN Print Spooler. Version 1.77
> (c) 1990-2004 by Gurtjak D., Ignatenko I., Goldberg A.
> Use extended memory: 200K
> Use conventional memory: 4K

Strange indeed! Can you install gawk? If tolower() and reasonable support for
"--version" are missing there's no telling what else might be less than ideal
about that awk version and gawk provides a lot of VERY useful additional
functionality.

Ed.


Posted using www.webuse.net

Steve Hayes

unread,
Sep 19, 2012, 4:37:57 PM9/19/12
to
On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma...@googlemail.com.invalid
(Dave Gibson) wrote:

>Steve Hayes <haye...@telkomsa.net> wrote:
>> I'm not sure what that X-CC-Diagnistic thingy is. It seems big.
>
>The gibberish is the message body encoded as base64 -- it's not associated
>with a specific header.

I've just been checking some of the messages I've been trying to save.

These ones are hard to read and save:

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

These are not quite as hard to read or save, but still cause some problems:

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

These ones are easy to read and save:

Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

The ones that are hardest to read and save appear to be produced by Windows
Live Mail.

Perhaps one could tell awk to delete such messages. Would it also be able to
convert "quoted printable" into something more readable?

The Natural Philosopher

unread,
Sep 19, 2012, 5:54:32 PM9/19/12
to
Steve Hayes wrote:
> On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:
>
>> Capturing the full lines of the headers even if they stretched over more
>> than one line is more difficult and I am sure not going to spend time
>> thinking about it since the OP never said why he wanted this, or whether
>> it was more than simply a passing curiosity to him. You are welcome to
>> do it if you care to.
>
> I've been wanting to do something like this for 20 years, and when I saw AWK
> and its description I thought it might be able to do something like this, but
> I didn't see how.
>
> When someone asked if awk could perform a somewhat similar task, and there
> appeared to be some awk fundis who knew how to make the thing work, I then
> asked if it could do what I wanted it to do - in other words remove extraneous
> headers from saved e-mail messages, which would make it easier to import them
> into a database.
>
> As I said elsewhere, in spite of having a version of awk lurking on my
> computer for 20 years or so, I've never known how to used it, and I'm a
> complete novice, but I hope to learn something from those who do know how to
> use it.
>
>
I wrote some stuff in awk once. I've even got the manual. It took me
longer to get it working than the replacement which I wrote in C....and
ran a lot slower.




--
Ineptocracy

(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.

unruh

unread,
Sep 19, 2012, 8:07:13 PM9/19/12
to
On 2012-09-19, Steve Hayes <haye...@telkomsa.net> wrote:
> On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:
>
>>Capturing the full lines of the headers even if they stretched over more
>>than one line is more difficult and I am sure not going to spend time
>>thinking about it since the OP never said why he wanted this, or whether
>>it was more than simply a passing curiosity to him. You are welcome to
>>do it if you care to.
>
> I've been wanting to do something like this for 20 years, and when I saw AWK
> and its description I thought it might be able to do something like this, but
> I didn't see how.

Well, now you have seen how. But the chance that someone is going to
write the program for you is small. The headers are not the space
problem is saving emails. The body is. Even a very large header is going
to less than 1K, where the body these days is more like 1M or more.
So it is a pretty silly task (straining at the gnats while ignoring the
camel) to reduce the headers.

>
> When someone asked if awk could perform a somewhat similar task, and there
> appeared to be some awk fundis who knew how to make the thing work, I then
> asked if it could do what I wanted it to do - in other words remove extraneous
> headers from saved e-mail messages, which would make it easier to import them
> into a database.
>
> As I said elsewhere, in spite of having a version of awk lurking on my
> computer for 20 years or so, I've never known how to used it, and I'm a
> complete novice, but I hope to learn something from those who do know how to
> use it.

Buy the Awk book and read it.

>
>

Steve Hayes

unread,
Sep 20, 2012, 3:48:57 AM9/20/12
to
On Thu, 20 Sep 2012 00:07:13 GMT, unruh <un...@invalid.ca> wrote:

>On 2012-09-19, Steve Hayes <haye...@telkomsa.net> wrote:
>> On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:

>Well, now you have seen how. But the chance that someone is going to
>write the program for you is small. The headers are not the space
>problem is saving emails. The body is. Even a very large header is going
>to less than 1K, where the body these days is more like 1M or more.
>So it is a pretty silly task (straining at the gnats while ignoring the
>camel) to reduce the headers.

It's not a space problem, it's a readability problem. Ten years after a
message has been sent, the routing information etc will be of little interest.

>
>>
>> When someone asked if awk could perform a somewhat similar task, and there
>> appeared to be some awk fundis who knew how to make the thing work, I then
>> asked if it could do what I wanted it to do - in other words remove extraneous
>> headers from saved e-mail messages, which would make it easier to import them
>> into a database.
>>
>> As I said elsewhere, in spite of having a version of awk lurking on my
>> computer for 20 years or so, I've never known how to used it, and I'm a
>> complete novice, but I hope to learn something from those who do know how to
>> use it.
>
>Buy the Awk book and read it.

That's quite an expensive exercise if it turns out that awk is, after all, not
suitable for the task.

I'm glad that not all awk users are as rude and unhelpful as you.

[follow ups set]

no.to...@gmail.com

unread,
Sep 20, 2012, 8:00:53 AM9/20/12
to
In article <pspci9x...@perseus.wenlock-data.co.uk>, dave.gma...@googlemail.com.invalid (Dave Gibson) wrote:

> In comp.lang.awk, no.to...@gmail.com wrote:
> > In article <k2v9om$l3s$1...@dont-email.me>, Ed Morton <morto...@gmail.com> wrote:
> > --snip --

Dave Gibson's current scritp is:---

> awk -f the_following_script FileDelete FileIn > result
>
> ----script begins on next line
> #! /usr/bin/awk -f
>
> function flush_buffer(discard, n) {
> if (!discard)
> for (n = 1; n <= bufpos; n++)
> print buffer[n]
> bufpos = 0 \\ local var
> }
>
> NR == FNR {
> delete_list[++delmax] = $0 \\ 1stReadFile -> delete_list: ARRAY
> next
> }
> \\ AFTER 1stReadFile DONE & 2ndReadFile
> $0 ~ delete_list[bufpos + 1] { \\ IF CurrentLine ~
> buffer[++bufpos] = $0
> if (bufpos >= delmax)
> flush_buffer(blocks_seen++)
> next
> }
>
> bufpos {
> flush_buffer(0)
> }
>
> { print }
>
> END {
> flush_buffer(0)
> }
> ----script ends on previous line

This non trivial task is part of a family-of-tasks that
we need to do all the time: clean out redundant/repeated stuff.

I've previously got some very usefull scripts from USEnet
collaborations, like:
list all files in dir-tree $1
which are less than are N-days old,
and which contain String1,
...
and which contain StringN

I use those scripts every day, and THIS one will be valuable too.
So, I've made a test-script, to help beta-test the versions.

The task is to delete repeated/redundant blocks of lines
[although as Ed Morton pointed out, awk is not limited to lines]
from text files.

Input files have the format of: Ha|Hb|Hc|Hd|...Hn|
where H is the repeated/redundant block of text,
and a,b...n is the valuable text, to be kept,
and | represents a one-line-section-separator:
typically "<><><><>"

My test script which assembles a,d,..H blocks into the Infile,
by using the human-edited H block, gave the following results,
which should probably be ignored, for difficulty of understanding.

The test conclusions, so far, are that:
if chars "(", "]", "[" are in the DeleteFile: H,
this gives problems.

Are these special-chars for `bash` ?

Thanks,

== Chris Glur.

Copy existing files for: a, b, c, d, e
use simple 1-char: H

-> ./BuildI == construct FileIn from parts:
len-I = 713

-> cp H R

-> TstDG ==
len Infile = 713
len DeleteFile = 1
len ouTfile = 687
==> 713 - 687 == 26 <-- expect 713 - 4 == 699
===> perhaps extra 'H' files were found.
====> test with unusual type of 'H' file

-> echo qzxv >> H
-> ./BuildI == construct FileIn from parts:
len-I = 718
-> cp H R
-> TstDG ==
len Infile = 718
len DeleteFile = 2
len ouTfile = 710
==> 718 - 710 == 8 == 2*4 == OK

====> now use a big random: H
-> ./BuildI == construct FileIn from parts:
len-I = 1538
-> cp H R
-> ./TstDG ==
len Infile = 1538
len DeleteFile = 166
len ouTfile = 1538
==> suspect <special chars> in R

====> combine
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 1538
len Infile = 1538
len DeleteFile = 166
len ouTfile = 1538
==> as expected/confirmed

==> keep problematic 'H' as Horg & edit out suspected line/s
==> Let 'H'contain NO-square-brackets.
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 763
len Infile = 763
len DeleteFile = 11
len ouTfile = 708
==> 763 - 708 == 55 == 5*11
===> suspect that now-reduced: H appears 6-times in FileIn
==> Yes, with difficulty, an editor confirms: H appears 6-times

-> cp H Horg2
=> edit/select 'H' to contain line:" [44]Gravatar"
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 838
len Infile = 838
len DeleteFile = 26
len ouTfile = 838
=> confirm problem with chars: "[","]"
=?=> which one or both: == deleting BOTH still FAILS
=> iteratively delete-1st-line of H until notFAIL
--------------- one line deleted between tests ---------------
construct FileIn from parts:
len-I = 793
len Infile = 793
len DeleteFile = 17
len ouTfile = 793
bash-3.1# ./BuildNtest
construct FileIn from parts:
len-I = 783
len Infile = 783
len DeleteFile = 15
len ouTfile = 723
---------------------------------------------------------
=?=!=> the line that caused the FAIL:-
Name (required)

==> adding an un-matched "(" 10 lines before end of 'H' causes:
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 783
awk: DGscript:15: (FILENAME=I FNR=7) fatal: Unmatched ( or \(: / * (/
len Infile = 783
len DeleteFile = 15
len ouTfile = 0

==> remove "(" & test "]"
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 783
len Infile = 783
len DeleteFile = 15
len ouTfile = 783

=> replace with "[]"
-> DGscript ==
:15: (FILENAME=I FNR=7) fatal: Unmatched [ or [^: / * []/
len Infile = 783
len DeleteFile = 15
len ouTfile = 0

====================

Dave Gibson

unread,
Sep 20, 2012, 9:40:04 AM9/20/12
to
Steve Hayes <haye...@telkomsa.net> wrote:
> On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma...@googlemail.com.invalid
> (Dave Gibson) wrote:
>
>>Steve Hayes <haye...@telkomsa.net> wrote:
>>> I'm not sure what that X-CC-Diagnistic thingy is. It seems big.
>>
>>The gibberish is the message body encoded as base64 -- it's not
>>associated with a specific header.
>
> I've just been checking some of the messages I've been trying to save.
>
> These ones are hard to read and save:
>
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: base64
>
> These are not quite as hard to read or save, but still cause some
> problems:
>
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: quoted-printable
>
> These ones are easy to read and save:
>
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: 7bit
>
> The ones that are hardest to read and save appear to be produced
> by Windows Live Mail.

They're standard MIME encodings intended to prevent message data being
corrupted in transit.

<http://tools.ietf.org/html/rfc2045>

Your mail user agent should be able to convert them to local format
while saving them.

Have a look at these:

<http://www.convertstring.com/EncodeDecode/Base64Decode>
<http://www.convertstring.com/EncodeDecode/QuotedPrintableDecode>

>
> Perhaps one could tell awk to delete such messages.

Wouldn't you rather decode them?

<http://www.fourmilab.ch/webtools/base64/>

Anyway, assuming messages are in a digest, separated by lines containing
the string "-- End --" and none of them are multipart messages.

#v+
----script begins on next line
/^-- End --/ {
if (!b64)
print
body = 0
b64 = 0
next
}

b64 { next }

!body && /^$/ {
for (n = 1; n <= hlines; n++)
print header[n]
hlines = 0
body = 1
}

body { print ; next }

/^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Tt][Rr][Aa][Nn][Ss][Ff][Ee][Rr]-[Ee][Nn][Cc][Oo][Dd][Ii][Nn][Gg]: [Bb][Aa][Ss][Ee]64/ {
b64 = 1
hlines = 0
next
}

{ header[++hlines] = $0 }
----script ends on previous line
#v-

> Would it also
> be able to convert "quoted printable" into something more readable?

Perl has modules for dealing with various mail formats so may well be
better suited to your requirements.

#v+
----script begins on next line
BEGIN {
hex["0"] = 0 ; hex["1"] = 1 ; hex["2"] = 2 ; hex["3"] = 3
hex["4"] = 4 ; hex["5"] = 5 ; hex["6"] = 6 ; hex["7"] = 7
hex["8"] = 8 ; hex["9"] = 9 ; hex["A"] = 10 ; hex["B"] = 11
hex["C"] = 12 ; hex["D"] = 13 ; hex["E"] = 14 ; hex["F"] = 15
for (n = 0 ; n <= 255; n++)
ch[n] = sprintf("%c", n)
}

/^-- End --/ { qp = 0 ; body = 0 }

/^$/ { body = 1 }

!body && /^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Tt][Rr][Aa][Nn][Ss][Ff][Ee][Rr]-[Ee][Nn][Cc][Oo][Dd][Ii][Nn][Gg]: [Qq][Uu][Oo][Tt][Ee][Dd]-[Pp][Rr][Ii][Nn][Tt][Aa][Bb][Ll][Ee]/ {
qp = 1
$NF = "8bit"
}

body && qp && /=/ {
s = $0
# Brackets '[', ']' on next line contain a space and a tab
u = sub(/=[ ]*$/, "", s)
t = ""
while (match(s, /=[0-9A-F][0-9A-F]/)) {
t = t substr(s, 1, RSTART - 1) \
ch[hex[substr(s, RSTART + 1, 1)] * 16 + hex[substr(s, RSTART + 2, 1)]]
s = substr(s, RSTART + RLENGTH)
}
$0 = t s
if (u) {
printf "%s", $0
next
}
}

{ print }
----script ends on previous line
#v-

Loki Harfagr

unread,
Sep 20, 2012, 10:05:12 AM9/20/12
to
Wed, 19 Sep 2012 21:46:47 +0200, Steve Hayes did cat :

> On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma...@googlemail.com.invalid
> (Dave Gibson) wrote:
>
>> # Insert the following line here
>>/^-- End --/ { print ; body = 0 ; next }
>
> Brilliant, thanks.
>
>>The gibberish is the message body encoded as base64 -- it's not associated
>>with a specific header.
>
> Ah, yes, with that addition I can see that.

If you want some more fun you may like and try to finish that game
I started some time in the past but stopped right when it served my
needs ATM instead of making it fully RFC compliant ;-)
That was a shell wrapper for some awk parts which would encode or
decode base64 stuff.
At least it might hopefuly amuse some people here ;-)
------------
$ cat base64_in_awk.sh

### not yet complete nor compliant ;-)
###
### this is a simple base64 enc/dec in awk
### mostly made for fun but actually used in a few awk scripts I use
### in some other tools I wrote for fine grain analysis of
### texts, mainly emails, mainly spams to try and generate some
### synthetic regexps (or ideas of) regarding false positives or reinforcements.
### (and yes I know some tools exist in perl and I even use some of
### them which is another reason why I also do it in awk ;-)
### the wrapping is set at 72 like the 'mimencode' usage.
### (so, to avoid wrapping do set ORS to nil).
###
Aargh(){
r=$1
shift
printf "\n%s\nThats all, folks...\n\n" "${@}"
exit $r
}

[ $# -gt 1 ] || Aargh 1 "something is direly in the unseen world"
### most used way by default, anyway as 2 parms are mandatory this is
### only belting the suspenders
WOT=${1:-d}
shift
### gawk -v wot=$WOT -v ORS='' '
### gawk -v wot=$WOT -v ORS='µ' '
gawk -v wot=$WOT '
function _ba64dec(_b64str,_BASE64,_wrap,_res,_ba,_by,_len,_i,_j)
{
_len=split(_b64str,_ba,"")
while (_i<=_len){
if( 0==(++_wrap) %72){++_i;continue}
### get the 4 _bytes values and find their position in BASE64 base
for(_j=1;_j<5;_j++){
_by[_j] = index(_BASE64, _ba[++_i])
_by[_j]--
}
### Reconstruct ASCII string
_res = _res sprintf( "%c", lshift(and(_by[1], 63), 2) + rshift(and(_by[2], 48), 4) )
_res = _res sprintf( "%c", lshift(and(_by[2], 15), 4) + rshift(and(_by[3], 60), 2) )
_res = _res sprintf( "%c", lshift(and(_by[3], 3), 6) + _by[4] )
gsub(/[\x00\xff\xbf\x0f]/,"",_res)
}
return _res
}
function _ord(_char, i)
{
while(++i<256) if (sprintf("%c", i) == _char) return i
}

function _ba64enc(_b64str,_BASE64,_wrap, _ba1,_ba2,_ba3,_ba4,_by1,_by2,_by3,_by4, _res)
{
while (length(_b64str) > 0){
### find the values
_by1 = _ord(substr(_b64str, 1, 1))
if (length(_b64str) == 1){
_by2 = 0
_by3 = 0
}
if (length(_b64str) == 2){
_by2 = _ord(substr(_b64str, 2, 1))
_by3 = 0
}
if (length(_b64str) >= 3){
_by2 = _ord(substr(_b64str, 2, 1))
_by3 = _ord(substr(_b64str, 3, 1))
}

### transform to BASE64 values
_ba1 = rshift(_by1, 2)
_ba2 = lshift(and(_by1, 3), 4) + rshift(and(_by2, 240), 4)
_ba3 = lshift(and(_by2, 15), 2) + rshift(and(_by3, 192), 6)
_ba4 = and(_by3, 63)

### transmute values to BASE64 string
_res = _res substr(_BASE64, _ba1 + 1, 1)
_res = _res substr(_BASE64, _ba2 + 1, 1)
if (length(_b64str) == 1){
_res = _res "=="
_b64str = ""
}
if (length(_b64str) == 2){
_res = _res substr(_BASE64, _ba3 + 1, 1)
_res = _res "="
_b64str = ""
}
if (length(_b64str) >= 3){
_res = _res substr(_BASE64, _ba3 + 1, 1)
_res = _res substr(_BASE64, _ba4 + 1, 1)
_b64str = substr(_b64str, 4)
}
if( 0==(++_wrap) %18) _res=_res ORS
}
return _res
}
BEGIN{_w=0}
{
### Base64 for filenames given as alternate example, see RFC4648
### _BASE64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
_BASE64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
print wot=="d"?_ba64dec($0,_BASE64,_w):_ba64enc($0,_BASE64,_w)
}
' ${@}
------------

Aharon Robbins

unread,
Sep 20, 2012, 12:48:59 PM9/20/12
to
Hello Steve.

The short answer is that indeed awk can do what you want. The trick in
processing mail headers is that continuation lines are marked by a leading
space or tab *after* the header they are part of. You should also take into
account that header names are case insensitive - "To:", "to:" and "TO:" are
all the same.

You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz
which I wrote a long time ago - part of it processes Unix mailbox files,
including the headers. (Note that it is quite old, and tailored just for
a personal situation. Also note that all of the example email addresses
in it are invalid.)

Converting quoted printable in awk is also not hard. Basically, the "="
sign either precedes a newline that was added, or is followed by two
hexadecimal digits indicating an encoded character.

You do not need to buy any awk book. The gawk documentation is available
on line, free of charge, in a variety of formats at

http://www.gnu.org/software/gawk/manual/

Although I'm biased, I think this a great way to learn awk.

As a general plan of action, I recommend reading the gawk doc first,
in order to come up to speed on the language (hopefully in a gentle fashion :-)
and then attempting to write some code to do what you want. Once you have
that, if it doesn't work, come back here to ask questions.

I also recommend using the latest released version of gawk.

Good luck,

Arnold

In article <f97k58hdtpqnct0ih...@4ax.com>,
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Steve Hayes

unread,
Sep 20, 2012, 2:54:04 PM9/20/12
to
On 20 Sep 2012 14:05:12 GMT, Loki Harfagr <l0...@thedarkdesign.free.fr.INVALID>
wrote:

>Wed, 19 Sep 2012 21:46:47 +0200, Steve Hayes did cat�:
>
>> On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma...@googlemail.com.invalid
>> (Dave Gibson) wrote:
>>
>>> # Insert the following line here
>>>/^-- End --/ { print ; body = 0 ; next }
>>
>> Brilliant, thanks.
>>
>>>The gibberish is the message body encoded as base64 -- it's not associated
>>>with a specific header.
>>
>> Ah, yes, with that addition I can see that.
>
>If you want some more fun you may like and try to finish that game
>I started some time in the past but stopped right when it served my
>needs ATM instead of making it fully RFC compliant ;-)
>That was a shell wrapper for some awk parts which would encode or
>decode base64 stuff.

Wow, and I was just thinking of playing with something that might discard the
whole message if it had base64 stuff.

But when I've learnt a bit more of the basics I might try it.

Steve Hayes

unread,
Sep 20, 2012, 3:00:12 PM9/20/12
to
On Thu, 20 Sep 2012 16:48:59 +0000 (UTC), arn...@skeeve.com (Aharon Robbins)
wrote:

>Hello Steve.
>
>The short answer is that indeed awk can do what you want. The trick in
>processing mail headers is that continuation lines are marked by a leading
>space or tab *after* the header they are part of. You should also take into
>account that header names are case insensitive - "To:", "to:" and "TO:" are
>all the same.
>
>You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz
>which I wrote a long time ago - part of it processes Unix mailbox files,
>including the headers. (Note that it is quite old, and tailored just for
>a personal situation. Also note that all of the example email addresses
>in it are invalid.)
>
>Converting quoted printable in awk is also not hard. Basically, the "="
>sign either precedes a newline that was added, or is followed by two
>hexadecimal digits indicating an encoded character.

It seems to be capable of doing a lot more than I imagined it could.

>You do not need to buy any awk book. The gawk documentation is available
>on line, free of charge, in a variety of formats at
>
> http://www.gnu.org/software/gawk/manual/
>
>Although I'm biased, I think this a great way to learn awk.

I've got a book on Linux (actually a library book), which has a chapter on
gawk, and I've been re-reading it now that I've seen some samples of code and
what it does. But it's just a bare-bones summary.

>
>As a general plan of action, I recommend reading the gawk doc first,
>in order to come up to speed on the language (hopefully in a gentle fashion :-)
>and then attempting to write some code to do what you want. Once you have
>that, if it doesn't work, come back here to ask questions.
>
>I also recommend using the latest released version of gawk.

I've probably got that in my Linux partition, but most of the things I want to
use it for are in DOS.

Aharon Robbins

unread,
Sep 20, 2012, 3:05:54 PM9/20/12
to
>>You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz
>>...
>
>It seems to be capable of doing a lot more than I imagined it could.

Yes. :-)

>I've got a book on Linux (actually a library book), which has a chapter on
>gawk, and I've been re-reading it now that I've seen some samples of code and
>what it does. But it's just a bare-bones summary.

Invest some time in the gawk doc. I think it will return your investment.

>>I also recommend using the latest released version of gawk.
>
>I've probably got that in my Linux partition, but most of the things I want to
>use it for are in DOS.

See http://sourceforge.net/projects/ezwinports/ for MS-Windows binaries that
will run from a DOS prompt.

If you mean honest-to-goodness actual MS-DOS, then getting a version for
it will be harder. I believe that current sources can be compiled with
DJGPP, but I don't know if that will get you what you want.

You can probably find something on the Internet, but it's likely to be
an older version, and often such versions have bugs... So, Caveat Emptor. :-)

Good luck,

Arnold

Dave Gibson

unread,
Sep 20, 2012, 5:02:27 PM9/20/12
to
[ Followup-To: set to comp.lang.awk ]

In comp.lang.awk, no.to...@gmail.com wrote:
> In article <pspci9x...@perseus.wenlock-data.co.uk>,
> dave.gma...@googlemail.com.invalid (Dave Gibson) wrote:
>
>> In comp.lang.awk, no.to...@gmail.com wrote:

>> awk -f the_following_script FileDelete FileIn > result
>>
>> ----script begins on next line
>> #! /usr/bin/awk -f
>>
>> function flush_buffer(discard, n) {
>> if (!discard)
>> for (n = 1; n <= bufpos; n++)
>> print buffer[n]
>> bufpos = 0 \\ local var
>> }

bufpos is global, discard and n are only visible within that function.

>>
>> NR == FNR {
>> delete_list[++delmax] = $0 \\ 1stReadFile -> delete_list: ARRAY
>> next
>> }
>> \\ AFTER 1stReadFile DONE & 2ndReadFile
>> $0 ~ delete_list[bufpos + 1] { \\ IF CurrentLine ~

The ~ is awk's match operator.

Maybe think of it as:

IF RegexCompare(CurrentLine, delete_list[bufpos + 1]) THEN

>> buffer[++bufpos] = $0
>> if (bufpos >= delmax)
>> flush_buffer(blocks_seen++)
>> next
>> }
>>
>> bufpos {
>> flush_buffer(0)
>> }

That's a bug. It's necessary to check whether $0 matches delete_list[1]
(and restart buffering if it does) after flushing the buffer.

>>
>> { print }
>>
>> END {
>> flush_buffer(0)
>> }
>> ----script ends on previous line

> The test conclusions, so far, are that:
> if chars "(", "]", "[" are in the DeleteFile: H,
> this gives problems.

They are regular expression metacharacters with special meaning to
awk's match operator.

Here's the fixed version of the script:

----script begins on next line
#! /usr/bin/awk -f

function flush_buffer(discard, n) {
if (!discard)
for (n = 1; n <= bufpos; n++)
print buffer[n]
bufpos = 0
}

function try_seq(s) {
if (s ~ delete_list[bufpos + 1]) {
buffer[++bufpos] = s
if (bufpos >= delmax)
flush_buffer(blocks_seen++)
return 1
}
return 0
}

NR == FNR {
delete_list[++delmax] = $0
next
}

try_seq($0) { next }

bufpos {
flush_buffer(0)
if (try_seq($0))
next
}

{ print }

END {
flush_buffer(0)
}
----script ends on previous line

The script works by loading the first file's contents into an
array. The array is a sequence of regular expressions.

The second file is scanned for sequences of lines in which each
line matches the corresponding entry in the array of regular
expressions.

When a complete sequence of matches is made the matched lines are
discarded if they are not the first occurrence of a match-sequence.

Input file 1 (FileDelete2) contains three lines:
a
b
[cz]

Input file 2 (FileIn2) contains 17 lines:
1 : nothing
2 : a MATCHES THE FIRST PATTERN IN THE SEQUENCE
3 : b MATCHES THE SECOND PATTERN IN THE SEQUENCE
4 : k
5 : a 2 SEQUENCE BEGINS
6 : b 2
7 : c 1 FULL SEQUENCE MATCHES (FIRST TIME: 5,6,7 PRINTED)
8 : a 3 SEQUENCE BEGINS
9 : a 4 SEQUENCE FAILS, LINE 8 PRINTED, NEW SEQUENCE BEGINS
10: b 3
11: c 2 FULL SEQUENCE MATCHES (NOT FIRST TIME: 9,10,11 OMITTED)
12: NO MATCH
13: c 3 OUT OF SEQUENCE, NO MATCH
14: a 5 FIRST IN SEQUENCE
15: b 4 SECOND IN SEQUENCE
16: z 1 THIRD IN SEQUENCE (14, 15, 16 DROPPED)
17: example SEQUENCE BEGINS, FAILS DUE TO END-OF-INPUT, 17 PRINTED

The command

awk -f the_above_script FileDelete2 FileIn2

Will print lines 1 to 8, 12, 13 and 17.

Manuel Collado

unread,
Sep 21, 2012, 4:47:56 AM9/21/12
to
El 20/09/2012 18:48, Aharon Robbins escribió:
> ...
> You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz
> which I wrote a long time ago ...

It seems that this file is a weaved noweb Literate Programming document.
¿Is the noweb source code also available? I'm still interested on LP,
and there are very few real LP examples in the net.

Thanks,
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado



f...@informatik.uni-bremen.de

unread,
Sep 28, 2012, 3:23:46 PM9/28/12
to
In article <3scsi9x...@perseus.wenlock-data.co.uk>, dave.gma...@googlemail.com.invalid (Dave Gibson) wrote:

Someone contaminated my thread.

lets move this to:
Newsgroups: comp.lang.awk,comp.os.linux.misc
Subject: awk: DeleteRepeatingTextBlocks

Let's try to add-value for the *nix community by revealing
methods which others can modify and use for their problems.

===========
> >> print buffer[n]
> >> bufpos = 0 \\ local var
> >> }
>
> bufpos is global, discard and n are only visible within that function.

-> man awk | grep bufpos == <empty>
So 'bufpos' is not a reserved-word [mentioned in man].
So how does awk's syntax make it global, whereas
'discard', 'n' are local. I see 'bufpos' further in the code.
=================
> The script works by loading the first file's contents into an
> array. The array is a sequence of regular expressions.

"loading the first file's contents into an array."
is an action intended to help achieve a HIGHER goal.
It's better to state the higher goal FIRST.

It's called top-down-design.
The implementation details, which must be bottom-up,
are best not explained until the top-down-design is
known. Here's my STRUCTURED view:

SpeedUp knowledge absorbtion from http-fetched text
Delete noise/distracting repeated garbage
Identify garbage
Must be done by human intelligence
use a standard editor -- while reading/studying the text
Automate the removal of further garbage-repeats
Search the InFile for further matches of human-identified-trash

==> PS. I'm writing this WHILE I'm trying to decode your explanation.
The decomposition-chain is: Delete needs Match needs Regex.

You are going to compare the DeleteFile with the InFile.
To handle the regex requirement, you are building
"an array of regular expressions".
Apparently to <match the array with InFile parts> ?

My test results for your new script are dumb, since I have no
intermediate output traces.
See Subject: awk: DeleteRepeatingTextBlocks

Thanks,

== Chris Glur.


Kaz Kylheku

unread,
Sep 28, 2012, 7:14:46 PM9/28/12
to
On 2012-09-28, f...@informatik.uni-bremen.de <f...@informatik.uni-bremen.de> wrote:
> Identify garbage

Easy: pretty much everything posted by the incompetent originator of this
retarded thread.

Keith Keller

unread,
Sep 28, 2012, 10:12:52 PM9/28/12
to
["Followup-To:" header set to comp.os.linux.misc.]
> Let's try to add-value for the *nix community by revealing
> methods which others can modify and use for their problems.

I think the best way to add value for the *nix community is for you to
stop asking these poorly phrased questions, and for the rest of us to
stop answering them.

--keith

--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

Chick Tower

unread,
Sep 29, 2012, 3:41:41 PM9/29/12
to
That was him using another pseudonym.
--
Chick Tower

For e-mail: colm DOT sent DOT towerboy AT xoxy DOT net
0 new messages