Did I fail to get these test resulsts for:
Subject: Re: Re (6): Is awk suitable for this?
Date: Mon, 17 Sep 2012 14:13:49 GMT
back within a month?
Since this utility's Ver1 has proven successful, it's appropriate to
expalin further why WE need it.
Most info from the net, these days is fetched via http - regretable IMO.
So there are aZillion webPage designers all following the same template.
Like lawers who'll never use ten words, when tenThousand will do, they
fill out each web-page that you fetch with repetative bloat.
Consider the pattern from decades ago, where you needed to extact a 6 part
article from a monthly periodical, where the average article-part size was
2 & half pages, for an average periodical size of 40 pages.
So too today, if you fetched your http-info by lynx/links/elinks, instead
of having the text wrapped in pictures, you'd notice that the valuable
text-info is wrapped in repetative/garbage. It's garbage because you
only need ONE copy.
Inevitably if you need some info from a site, you'll need to get several
'pages'. Which means that the redundant packaging will be the same for
each page fetched.
In order to BuildAbook of the particular info, I use a script(s) which does:
So when I [pay to] go online, to get my email, a 2line script may run:
g1277 medL <medicalDir>/<BookTitle>
g1277 electroL <electro>/<elBookTitle>
where `g1277` is the above described script, which needs 1, 2 args
and does FetchTheTextMax: 77Char LineLen.
So then after the email is d/l-ed, the 2 books, already in their
appropriate directory, are initiated or expanding.
But to clean-out the annoying repetitions, we need:
EdMorton's FileCleaner awk; which is coded in the above mentioned thread.
===============> here are some test results for vers1.
=> extract to R some <lines to be deleted>
-> awk -f tst.awk R /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d > J14dT
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d | wc -l == 2961
-> cat J14dT | wc -l == 2931 Good! R is 2 lines; so 15 repeats were removed.
==> extract another 5-lines to R2
-> awk -f tst.awk R2 J14dT > J14dT2
-> cat J14dT2 | wc -l == 2856 ==> 5 * 15 == Good.
==> extract a 1-liner to R3
-> awk -f tst.awk R3 J14dT2 > J14dT3
-> cat J14dT3 | wc -l == 2856 = none removed ?
-> cat R3 ==
-> ls -l J14dT* ==
-rw-r--r-- 1 root root 133614 2012-10-16 21:32 J14dT
-rw-r--r-- 1 root root 129813 2012-10-16 21:39 J14dT2
-rw-r--r-- 1 root root 129813 2012-10-16 21:48 J14dT3
=> perhaps there's a problem with one-liners, or I forgot EOL?
-> awk -f tst.awk S /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 > D12a
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 | wc -l == 9019
-> cat D12a | wc -l == 8863 = good.
==> let's try deleteing one-and-half lines ?
-> awk -f tst.awk S1.5 D12a > D12b
-> cat D12b | wc -l == 8863
-> cat S1.5 ==
=> the second-line had no EOL. Need more investigation.
=> manually viewing: there are many "View profile"
eg:" View profile"
? Does this mean that <DeleteFile> must be 1 < LineCount
That's not a big-problem.
An interesting minor error is that the tail of the InFile
is changed from eg:---------------
----------------, not repetatively, but only for the
first cleaning-cycle of the file.
So the <EOL>"<><><><><>" is appened once only.
More importantly, I need to be able to roam the file tree, and apply
DeleteRepeatBloks aka `DR` after extracting a candidate text-block.
I use `mc`. So far, I'm having problems re. the <pathAccess> from the
DR-script [which is in PATH], to DeleteFile. So far I've just put the
DeleteFile in the same dir as the InFile. But this tends to leave extra
scattered garbage. And since it's a one-at-time-user-PC, it may be
better to extract the DeleteFile to a fixed/default location: /tmp/R.
And then there's the refinement of how-best to handle the new-identity of
the text-file as it becomes cleaned through N-cycles.
NB. The book gets cleaner, the more it's read/used.