On Wed, 19 Sep 2012 12:56:13 +0100, dave.gma+news...@googlemail.com.invalid
(Dave Gibson) wrote:
> # Insert the following line here
>/^-- End --/ { print ; body = 0 ; next }
Brilliant, thanks.
>The gibberish is the message body encoded as base64 -- it's not associated
>with a specific header.
Ah, yes, with that addition I can see that.
-- Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
Steve Hayes <hayes...@telkomsa.net> wrote:
> On Tue, 18 Sep 2012 18:28:10 GMT, "Ed Morton" <mortons...@gmail.com> wrote:
> >Steve Hayes <hayes...@telkomsa.net> wrote:
> ><snip> > >> When I tried it, I got this message:
> >> awk line 26:function not defined tolower
> >What OS are you running on? Also, please copy/paste the result of
> >running "awk
> >--version" so we know for sure what awk you have. Whatever it is, it's
> >not GNU
> >awk so I'd highly recommend you install that and use it from now on. It's
> >available at http://www.gnu.org/software/gawk/.
> It's ancient -- Version 3.0 for DOS, but when I ran that I got something very
> strange:
> DOSPRN Print Spooler. Version 1.77
> (c) 1990-2004 by Gurtjak D., Ignatenko I., Goldberg A.
> Use extended memory: 200K
> Use conventional memory: 4K
Strange indeed! Can you install gawk? If tolower() and reasonable support for
"--version" are missing there's no telling what else might be less than ideal
about that awk version and gawk provides a lot of VERY useful additional
functionality.
The ones that are hardest to read and save appear to be produced by Windows
Live Mail.
Perhaps one could tell awk to delete such messages. Would it also be able to
convert "quoted printable" into something more readable?
-- Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
Steve Hayes wrote:
> On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:
>> Capturing the full lines of the headers even if they stretched over more
>> than one line is more difficult and I am sure not going to spend time
>> thinking about it since the OP never said why he wanted this, or whether
>> it was more than simply a passing curiosity to him. You are welcome to
>> do it if you care to.
> I've been wanting to do something like this for 20 years, and when I saw AWK
> and its description I thought it might be able to do something like this, but
> I didn't see how.
> When someone asked if awk could perform a somewhat similar task, and there
> appeared to be some awk fundis who knew how to make the thing work, I then
> asked if it could do what I wanted it to do - in other words remove extraneous
> headers from saved e-mail messages, which would make it easier to import them
> into a database.
> As I said elsewhere, in spite of having a version of awk lurking on my
> computer for 20 years or so, I've never known how to used it, and I'm a
> complete novice, but I hope to learn something from those who do know how to
> use it.
I wrote some stuff in awk once. I've even got the manual. It took me longer to get it working than the replacement which I wrote in C....and ran a lot slower.
-- Ineptocracy
(in-ep-toc’-ra-cy) – a system of government where the least capable to lead are elected by the least capable of producing, and where the members of society least likely to sustain themselves or succeed, are rewarded with goods and services paid for by the confiscated wealth of a diminishing number of producers.
>>Capturing the full lines of the headers even if they stretched over more
>>than one line is more difficult and I am sure not going to spend time
>>thinking about it since the OP never said why he wanted this, or whether
>>it was more than simply a passing curiosity to him. You are welcome to
>>do it if you care to.
> I've been wanting to do something like this for 20 years, and when I saw AWK
> and its description I thought it might be able to do something like this, but
> I didn't see how.
Well, now you have seen how. But the chance that someone is going to
write the program for you is small. The headers are not the space
problem is saving emails. The body is. Even a very large header is going
to less than 1K, where the body these days is more like 1M or more. So it is a pretty silly task (straining at the gnats while ignoring the
camel) to reduce the headers.
> When someone asked if awk could perform a somewhat similar task, and there
> appeared to be some awk fundis who knew how to make the thing work, I then
> asked if it could do what I wanted it to do - in other words remove extraneous
> headers from saved e-mail messages, which would make it easier to import them
> into a database.
> As I said elsewhere, in spite of having a version of awk lurking on my
> computer for 20 years or so, I've never known how to used it, and I'm a
> complete novice, but I hope to learn something from those who do know how to
> use it.
On Thu, 20 Sep 2012 00:07:13 GMT, unruh <un...@invalid.ca> wrote:
>On 2012-09-19, Steve Hayes <hayes...@telkomsa.net> wrote:
>> On Wed, 19 Sep 2012 17:11:08 GMT, unruh <un...@invalid.ca> wrote:
>Well, now you have seen how. But the chance that someone is going to
>write the program for you is small. The headers are not the space
>problem is saving emails. The body is. Even a very large header is going
>to less than 1K, where the body these days is more like 1M or more. >So it is a pretty silly task (straining at the gnats while ignoring the
>camel) to reduce the headers.
It's not a space problem, it's a readability problem. Ten years after a
message has been sent, the routing information etc will be of little interest.
>> When someone asked if awk could perform a somewhat similar task, and there
>> appeared to be some awk fundis who knew how to make the thing work, I then
>> asked if it could do what I wanted it to do - in other words remove extraneous
>> headers from saved e-mail messages, which would make it easier to import them
>> into a database.
>> As I said elsewhere, in spite of having a version of awk lurking on my
>> computer for 20 years or so, I've never known how to used it, and I'm a
>> complete novice, but I hope to learn something from those who do know how to
>> use it.
>Buy the Awk book and read it.
That's quite an expensive exercise if it turns out that awk is, after all, not
suitable for the task.
I'm glad that not all awk users are as rude and unhelpful as you.
[follow ups set]
-- Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
> awk -f the_following_script FileDelete FileIn > result
> ----script begins on next line
> #! /usr/bin/awk -f
> function flush_buffer(discard, n) {
> if (!discard)
> for (n = 1; n <= bufpos; n++)
> print buffer[n]
> bufpos = 0 \\ local var
> }
> NR == FNR {
> delete_list[++delmax] = $0 \\ 1stReadFile -> delete_list: ARRAY
> next
> }
> \\ AFTER 1stReadFile DONE & 2ndReadFile
> $0 ~ delete_list[bufpos + 1] { \\ IF CurrentLine ~ > buffer[++bufpos] = $0
> if (bufpos >= delmax)
> flush_buffer(blocks_seen++)
> next
> }
> bufpos {
> flush_buffer(0)
> }
> { print }
> END {
> flush_buffer(0)
> }
> ----script ends on previous line
This non trivial task is part of a family-of-tasks that
we need to do all the time: clean out redundant/repeated stuff.
I've previously got some very usefull scripts from USEnet collaborations, like: list all files in dir-tree $1
which are less than are N-days old,
and which contain String1,
...
and which contain StringN
I use those scripts every day, and THIS one will be valuable too.
So, I've made a test-script, to help beta-test the versions.
The task is to delete repeated/redundant blocks of lines
[although as Ed Morton pointed out, awk is not limited to lines]
from text files.
Input files have the format of: Ha|Hb|Hc|Hd|...Hn|
where H is the repeated/redundant block of text,
and a,b...n is the valuable text, to be kept,
and | represents a one-line-section-separator:
typically "<><><><>"
My test script which assembles a,d,..H blocks into the Infile,
by using the human-edited H block, gave the following results,
which should probably be ignored, for difficulty of understanding.
The test conclusions, so far, are that:
if chars "(", "]", "[" are in the DeleteFile: H,
this gives problems.
Are these special-chars for `bash` ?
Thanks,
== Chris Glur.
Copy existing files for: a, b, c, d, e
use simple 1-char: H
-> ./BuildI == construct FileIn from parts:
len-I = 713
-> cp H R
-> TstDG ==
len Infile = 713
len DeleteFile = 1
len ouTfile = 687
==> 713 - 687 == 26 <-- expect 713 - 4 == 699
===> perhaps extra 'H' files were found.
====> test with unusual type of 'H' file
-> echo qzxv >> H
-> ./BuildI == construct FileIn from parts:
len-I = 718
-> cp H R
-> TstDG ==
len Infile = 718
len DeleteFile = 2
len ouTfile = 710
==> 718 - 710 == 8 == 2*4 == OK
====> now use a big random: H
-> ./BuildI == construct FileIn from parts:
len-I = 1538
-> cp H R
-> ./TstDG ==
len Infile = 1538
len DeleteFile = 166
len ouTfile = 1538
==> suspect <special chars> in R
====> combine
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 1538
len Infile = 1538
len DeleteFile = 166
len ouTfile = 1538
==> as expected/confirmed
==> keep problematic 'H' as Horg & edit out suspected line/s
==> Let 'H'contain NO-square-brackets.
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 763
len Infile = 763
len DeleteFile = 11
len ouTfile = 708
==> 763 - 708 == 55 == 5*11
===> suspect that now-reduced: H appears 6-times in FileIn
==> Yes, with difficulty, an editor confirms: H appears 6-times
-> cp H Horg2 => edit/select 'H' to contain line:" [44]Gravatar"
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 838
len Infile = 838
len DeleteFile = 26
len ouTfile = 838
=> confirm problem with chars: "[","]"
=?=> which one or both: == deleting BOTH still FAILS
=> iteratively delete-1st-line of H until notFAIL
--------------- one line deleted between tests ---------------
construct FileIn from parts:
len-I = 793
len Infile = 793
len DeleteFile = 17
len ouTfile = 793
bash-3.1# ./BuildNtest
construct FileIn from parts:
len-I = 783
len Infile = 783
len DeleteFile = 15
len ouTfile = 723
---------------------------------------------------------
=?=!=> the line that caused the FAIL:-
Name (required)
==> adding an un-matched "(" 10 lines before end of 'H' causes:
-> ./BuildNtest ==
construct FileIn from parts:
len-I = 783
awk: DGscript:15: (FILENAME=I FNR=7) fatal: Unmatched ( or \(: / * (/
len Infile = 783
len DeleteFile = 15
len ouTfile = 0
==> remove "(" & test "]" -> ./BuildNtest ==
construct FileIn from parts:
len-I = 783
len Infile = 783
len DeleteFile = 15
len ouTfile = 783
=> replace with "[]"
-> DGscript ==
:15: (FILENAME=I FNR=7) fatal: Unmatched [ or [^: / * []/
len Infile = 783
len DeleteFile = 15
len ouTfile = 0
body && qp && /=/ {
s = $0
# Brackets '[', ']' on next line contain a space and a tab
u = sub(/=[ ]*$/, "", s)
t = ""
while (match(s, /=[0-9A-F][0-9A-F]/)) {
t = t substr(s, 1, RSTART - 1) \
ch[hex[substr(s, RSTART + 1, 1)] * 16 + hex[substr(s, RSTART + 2, 1)]]
s = substr(s, RSTART + RLENGTH)
}
$0 = t s
if (u) {
printf "%s", $0
next
}
>> # Insert the following line here
>>/^-- End --/ { print ; body = 0 ; next }
> Brilliant, thanks.
>>The gibberish is the message body encoded as base64 -- it's not associated
>>with a specific header.
> Ah, yes, with that addition I can see that.
If you want some more fun you may like and try to finish that game
I started some time in the past but stopped right when it served my
needs ATM instead of making it fully RFC compliant ;-)
That was a shell wrapper for some awk parts which would encode or
decode base64 stuff.
At least it might hopefuly amuse some people here ;-)
------------
$ cat base64_in_awk.sh
### not yet complete nor compliant ;-)
###
### this is a simple base64 enc/dec in awk
### mostly made for fun but actually used in a few awk scripts I use
### in some other tools I wrote for fine grain analysis of
### texts, mainly emails, mainly spams to try and generate some
### synthetic regexps (or ideas of) regarding false positives or reinforcements.
### (and yes I know some tools exist in perl and I even use some of
### them which is another reason why I also do it in awk ;-)
### the wrapping is set at 72 like the 'mimencode' usage.
### (so, to avoid wrapping do set ORS to nil).
###
Aargh(){
r=$1
shift
printf "\n%s\nThats all, folks...\n\n" "${@}"
exit $r
}
[ $# -gt 1 ] || Aargh 1 "something is direly in the unseen world"
### most used way by default, anyway as 2 parms are mandatory this is
### only belting the suspenders
WOT=${1:-d}
shift
### gawk -v wot=$WOT -v ORS='' '
### gawk -v wot=$WOT -v ORS='µ' '
gawk -v wot=$WOT '
function _ba64dec(_b64str,_BASE64,_wrap,_res,_ba,_by,_len,_i,_j)
{
_len=split(_b64str,_ba,"")
while (_i<=_len){
if( 0==(++_wrap) %72){++_i;continue}
### get the 4 _bytes values and find their position in BASE64 base
for(_j=1;_j<5;_j++){
_by[_j] = index(_BASE64, _ba[++_i])
_by[_j]--
}
### Reconstruct ASCII string
_res = _res sprintf( "%c", lshift(and(_by[1], 63), 2) + rshift(and(_by[2], 48), 4) )
_res = _res sprintf( "%c", lshift(and(_by[2], 15), 4) + rshift(and(_by[3], 60), 2) )
_res = _res sprintf( "%c", lshift(and(_by[3], 3), 6) + _by[4] )
gsub(/[\x00\xff\xbf\x0f]/,"",_res)
}
return _res
}
function _ord(_char, i)
{
while(++i<256) if (sprintf("%c", i) == _char) return i
}
function _ba64enc(_b64str,_BASE64,_wrap, _ba1,_ba2,_ba3,_ba4,_by1,_by2,_by3,_by4, _res)
{
while (length(_b64str) > 0){
### find the values
_by1 = _ord(substr(_b64str, 1, 1))
if (length(_b64str) == 1){
_by2 = 0
_by3 = 0
}
if (length(_b64str) == 2){
_by2 = _ord(substr(_b64str, 2, 1))
_by3 = 0
}
if (length(_b64str) >= 3){
_by2 = _ord(substr(_b64str, 2, 1))
_by3 = _ord(substr(_b64str, 3, 1))
}
BEGIN{_w=0}
{ ### Base64 for filenames given as alternate example, see RFC4648
### _BASE64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
_BASE64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
print wot=="d"?_ba64dec($0,_BASE64,_w):_ba64enc($0,_BASE64,_w)
The short answer is that indeed awk can do what you want. The trick in
processing mail headers is that continuation lines are marked by a leading
space or tab *after* the header they are part of. You should also take into
account that header names are case insensitive - "To:", "to:" and "TO:" are
all the same.
You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz which I wrote a long time ago - part of it processes Unix mailbox files,
including the headers. (Note that it is quite old, and tailored just for
a personal situation. Also note that all of the example email addresses
in it are invalid.)
Converting quoted printable in awk is also not hard. Basically, the "="
sign either precedes a newline that was added, or is followed by two
hexadecimal digits indicating an encoded character.
You do not need to buy any awk book. The gawk documentation is available
on line, free of charge, in a variety of formats at
Although I'm biased, I think this a great way to learn awk.
As a general plan of action, I recommend reading the gawk doc first,
in order to come up to speed on the language (hopefully in a gentle fashion :-)
and then attempting to write some code to do what you want. Once you have
that, if it doesn't work, come back here to ask questions.
I also recommend using the latest released version of gawk.
Good luck,
Arnold
In article <f97k58hdtpqnct0ihqvhmobn15qb54s...@4ax.com>,
Steve Hayes <hayes...@yahoo.com> wrote:
>>Capturing the full lines of the headers even if they stretched over more
>>than one line is more difficult and I am sure not going to spend time
>>thinking about it since the OP never said why he wanted this, or whether
>>it was more than simply a passing curiosity to him. You are welcome to
>>do it if you care to.
>I've been wanting to do something like this for 20 years, and when I saw AWK
>and its description I thought it might be able to do something like this, but
>I didn't see how.
>When someone asked if awk could perform a somewhat similar task, and there
>appeared to be some awk fundis who knew how to make the thing work, I then
>asked if it could do what I wanted it to do - in other words remove extraneous
>headers from saved e-mail messages, which would make it easier to import them
>into a database.
>As I said elsewhere, in spite of having a version of awk lurking on my
>computer for 20 years or so, I've never known how to used it, and I'm a
>complete novice, but I hope to learn something from those who do know how to
>use it.
>-- >Steve Hayes from Tshwane, South Africa
>Blog: http://khanya.wordpress.com >E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
-- Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
>>> # Insert the following line here
>>>/^-- End --/ { print ; body = 0 ; next }
>> Brilliant, thanks.
>>>The gibberish is the message body encoded as base64 -- it's not associated
>>>with a specific header.
>> Ah, yes, with that addition I can see that.
>If you want some more fun you may like and try to finish that game
>I started some time in the past but stopped right when it served my
>needs ATM instead of making it fully RFC compliant ;-)
>That was a shell wrapper for some awk parts which would encode or
>decode base64 stuff.
Wow, and I was just thinking of playing with something that might discard the
whole message if it had base64 stuff.
But when I've learnt a bit more of the basics I might try it.
-- Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
>The short answer is that indeed awk can do what you want. The trick in
>processing mail headers is that continuation lines are marked by a leading
>space or tab *after* the header they are part of. You should also take into
>account that header names are case insensitive - "To:", "to:" and "TO:" are
>all the same.
>You can see a fairly fancy program at http://www.skeeve.com/sendout3.ps.gz >which I wrote a long time ago - part of it processes Unix mailbox files,
>including the headers. (Note that it is quite old, and tailored just for
>a personal situation. Also note that all of the example email addresses
>in it are invalid.)
>Converting quoted printable in awk is also not hard. Basically, the "="
>sign either precedes a newline that was added, or is followed by two
>hexadecimal digits indicating an encoded character.
It seems to be capable of doing a lot more than I imagined it could.
>You do not need to buy any awk book. The gawk documentation is available
>on line, free of charge, in a variety of formats at
>Although I'm biased, I think this a great way to learn awk.
I've got a book on Linux (actually a library book), which has a chapter on
gawk, and I've been re-reading it now that I've seen some samples of code and
what it does. But it's just a bare-bones summary.
>As a general plan of action, I recommend reading the gawk doc first,
>in order to come up to speed on the language (hopefully in a gentle fashion :-)
>and then attempting to write some code to do what you want. Once you have
>that, if it doesn't work, come back here to ask questions.
>I also recommend using the latest released version of gawk.
I've probably got that in my Linux partition, but most of the things I want to
use it for are in DOS.
-- Steve Hayes from Tshwane, South Africa
Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk
>It seems to be capable of doing a lot more than I imagined it could.
Yes. :-)
>I've got a book on Linux (actually a library book), which has a chapter on
>gawk, and I've been re-reading it now that I've seen some samples of code and
>what it does. But it's just a bare-bones summary.
Invest some time in the gawk doc. I think it will return your investment.
>>I also recommend using the latest released version of gawk.
>I've probably got that in my Linux partition, but most of the things I want to
>use it for are in DOS.
If you mean honest-to-goodness actual MS-DOS, then getting a version for
it will be harder. I believe that current sources can be compiled with
DJGPP, but I don't know if that will get you what you want.
You can probably find something on the Internet, but it's likely to be
an older version, and often such versions have bugs... So, Caveat Emptor. :-)
Good luck,
Arnold
-- Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
In comp.lang.awk, no.top.p...@gmail.com wrote:
> In article <pspci9xcs2....@perseus.wenlock-data.co.uk>,
> dave.gma+news...@googlemail.com.invalid (Dave Gibson) wrote:
>> In comp.lang.awk, no.top.p...@gmail.com wrote:
>> awk -f the_following_script FileDelete FileIn > result
>> ----script begins on next line
>> #! /usr/bin/awk -f
>> function flush_buffer(discard, n) {
>> if (!discard)
>> for (n = 1; n <= bufpos; n++)
>> print buffer[n]
>> bufpos = 0 \\ local var
>> }
bufpos is global, discard and n are only visible within that function.
>> NR == FNR {
>> delete_list[++delmax] = $0 \\ 1stReadFile -> delete_list: ARRAY
>> next
>> }
>> \\ AFTER 1stReadFile DONE & 2ndReadFile
>> $0 ~ delete_list[bufpos + 1] { \\ IF CurrentLine ~
The ~ is awk's match operator.
Maybe think of it as:
IF RegexCompare(CurrentLine, delete_list[bufpos + 1]) THEN
>> buffer[++bufpos] = $0
>> if (bufpos >= delmax)
>> flush_buffer(blocks_seen++)
>> next
>> }
>> bufpos {
>> flush_buffer(0)
>> }
That's a bug. It's necessary to check whether $0 matches delete_list[1]
(and restart buffering if it does) after flushing the buffer.
>> { print }
>> END {
>> flush_buffer(0)
>> }
>> ----script ends on previous line
> The test conclusions, so far, are that:
> if chars "(", "]", "[" are in the DeleteFile: H,
> this gives problems.
They are regular expression metacharacters with special meaning to
awk's match operator.
Here's the fixed version of the script:
----script begins on next line
#! /usr/bin/awk -f
function flush_buffer(discard, n) {
if (!discard)
for (n = 1; n <= bufpos; n++)
print buffer[n]
bufpos = 0
}
function try_seq(s) {
if (s ~ delete_list[bufpos + 1]) {
buffer[++bufpos] = s
if (bufpos >= delmax)
flush_buffer(blocks_seen++)
return 1
}
return 0
}
NR == FNR {
delete_list[++delmax] = $0
next
}
try_seq($0) { next }
bufpos {
flush_buffer(0)
if (try_seq($0))
next
}
{ print }
END {
flush_buffer(0)
}
----script ends on previous line
The script works by loading the first file's contents into an
array. The array is a sequence of regular expressions.
The second file is scanned for sequences of lines in which each
line matches the corresponding entry in the array of regular
expressions.
When a complete sequence of matches is made the matched lines are
discarded if they are not the first occurrence of a match-sequence.
Input file 1 (FileDelete2) contains three lines:
a
b
[cz]
Input file 2 (FileIn2) contains 17 lines:
1 : nothing
2 : a MATCHES THE FIRST PATTERN IN THE SEQUENCE
3 : b MATCHES THE SECOND PATTERN IN THE SEQUENCE
4 : k
5 : a 2 SEQUENCE BEGINS
6 : b 2
7 : c 1 FULL SEQUENCE MATCHES (FIRST TIME: 5,6,7 PRINTED)
8 : a 3 SEQUENCE BEGINS
9 : a 4 SEQUENCE FAILS, LINE 8 PRINTED, NEW SEQUENCE BEGINS
10: b 3
11: c 2 FULL SEQUENCE MATCHES (NOT FIRST TIME: 9,10,11 OMITTED)
12: NO MATCH
13: c 3 OUT OF SEQUENCE, NO MATCH
14: a 5 FIRST IN SEQUENCE
15: b 4 SECOND IN SEQUENCE
16: z 1 THIRD IN SEQUENCE (14, 15, 16 DROPPED)
17: example SEQUENCE BEGINS, FAILS DUE TO END-OF-INPUT, 17 PRINTED
It seems that this file is a weaved noweb Literate Programming document. ¿Is the noweb source code also available? I'm still interested on LP, and there are very few real LP examples in the net.
In article <3scsi9xcs3....@perseus.wenlock-data.co.uk>, dave.gma+news...@googlemail.com.invalid (Dave Gibson) wrote:
Someone contaminated my thread.
lets move this to:
Newsgroups: comp.lang.awk,comp.os.linux.misc
Subject: awk: DeleteRepeatingTextBlocks
Let's try to add-value for the *nix community by revealing
methods which others can modify and use for their problems.
===========
> >> print buffer[n]
> >> bufpos = 0 \\ local var
> >> }
> bufpos is global, discard and n are only visible within that function.
-> man awk | grep bufpos == <empty>
So 'bufpos' is not a reserved-word [mentioned in man].
So how does awk's syntax make it global, whereas 'discard', 'n' are local. I see 'bufpos' further in the code.
=================
> The script works by loading the first file's contents into an
> array. The array is a sequence of regular expressions.
"loading the first file's contents into an array."
is an action intended to help achieve a HIGHER goal.
It's better to state the higher goal FIRST.
It's called top-down-design.
The implementation details, which must be bottom-up,
are best not explained until the top-down-design is
known. Here's my STRUCTURED view:
SpeedUp knowledge absorbtion from http-fetched text
Delete noise/distracting repeated garbage
Identify garbage
Must be done by human intelligence
use a standard editor -- while reading/studying the text
Automate the removal of further garbage-repeats
Search the InFile for further matches of human-identified-trash
==> PS. I'm writing this WHILE I'm trying to decode your explanation.
The decomposition-chain is: Delete needs Match needs Regex.
You are going to compare the DeleteFile with the InFile.
To handle the regex requirement, you are building "an array of regular expressions".
Apparently to <match the array with InFile parts> ?
My test results for your new script are dumb, since I have no intermediate output traces. See Subject: awk: DeleteRepeatingTextBlocks
["Followup-To:" header set to comp.os.linux.misc.]
On 2012-09-28, f...@informatik.uni-bremen.de <f...@informatik.uni-bremen.de> wrote:
> Let's try to add-value for the *nix community by revealing
> methods which others can modify and use for their problems.
I think the best way to add value for the *nix community is for you to
stop asking these poorly phrased questions, and for the rest of us to
stop answering them.
--keith
-- kkeller-use...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information