Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

EdMorton's FileCleaner awk Testresults

66 views
Skip to first unread message

no.to...@gmail.com

unread,
Oct 17, 2012, 11:49:37 AM10/17/12
to
Did I fail to get these test resulsts for:
Subject: Re: Re (6): Is awk suitable for this?
Date: Mon, 17 Sep 2012 14:13:49 GMT
back within a month?

Since this utility's Ver1 has proven successful, it's appropriate to
expalin further why WE need it.

Most info from the net, these days is fetched via http - regretable IMO.
So there are aZillion webPage designers all following the same template.
Like lawers who'll never use ten words, when tenThousand will do, they
fill out each web-page that you fetch with repetative bloat.

Consider the pattern from decades ago, where you needed to extact a 6 part
article from a monthly periodical, where the average article-part size was
2 & half pages, for an average periodical size of 40 pages.
So too today, if you fetched your http-info by lynx/links/elinks, instead
of having the text wrapped in pictures, you'd notice that the valuable
text-info is wrapped in repetative/garbage. It's garbage because you
only need ONE copy.

Inevitably if you need some info from a site, you'll need to get several
'pages'. Which means that the redundant packaging will be the same for
each page fetched.

In order to BuildAbook of the particular info, I use a script(s) which does:
ForEachLineInFile ListOfURLs
AppendToFile BookFile:
TheUrl
TheFetchedText
AsectionSeparatorLine"<><><>"

So when I [pay to] go online, to get my email, a 2line script may run:
g1277 medL <medicalDir>/<BookTitle>
g1277 electroL <electro>/<elBookTitle>
-------
where `g1277` is the above described script, which needs 1, 2 args
and does FetchTheTextMax: 77Char LineLen.

So then after the email is d/l-ed, the 2 books, already in their
appropriate directory, are initiated or expanding.
But to clean-out the annoying repetitions, we need:
EdMorton's FileCleaner awk; which is coded in the above mentioned thread.

===============> here are some test results for vers1.
=> extract to R some <lines to be deleted>
-> awk -f tst.awk R /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d > J14dT
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d | wc -l == 2961
-> cat J14dT | wc -l == 2931 Good! R is 2 lines; so 15 repeats were removed.
----------------
==> extract another 5-lines to R2
-> awk -f tst.awk R2 J14dT > J14dT2
-> cat J14dT2 | wc -l == 2856 ==> 5 * 15 == Good.
----------------
==> extract a 1-liner to R3
-> awk -f tst.awk R3 J14dT2 > J14dT3
-> cat J14dT3 | wc -l == 2856 = none removed ?
-> cat R3 ==
[67]View profile
-> ls -l J14dT* ==
-rw-r--r-- 1 root root 133614 2012-10-16 21:32 J14dT
-rw-r--r-- 1 root root 129813 2012-10-16 21:39 J14dT2
-rw-r--r-- 1 root root 129813 2012-10-16 21:48 J14dT3
=> perhaps there's a problem with one-liners, or I forgot EOL?
----------------
-> awk -f tst.awk S /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 > D12a
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 | wc -l == 9019
-> cat D12a | wc -l == 8863 = good.
----------------
==> let's try deleteing one-and-half lines ?
-> awk -f tst.awk S1.5 D12a > D12b
-> cat D12b | wc -l == 8863
-> cat S1.5 ==
[192]View profile
More options
=> the second-line had no EOL. Need more investigation.
----------------
=> manually viewing: there are many "View profile"
eg:" [101]View profile"
? Does this mean that <DeleteFile> must be 1 < LineCount
That's not a big-problem.

An interesting minor error is that the tail of the InFile
is changed from eg:---------------
175. http://www.google.com/preferences?hl=en
<><><><><>
---------------- to:

175. http://www.google.com/preferences?hl=en
<><><><><>

<><><><>
----------------, not repetatively, but only for the
first cleaning-cycle of the file.
So the <EOL>"<><><><><>" is appened once only.

More importantly, I need to be able to roam the file tree, and apply
DeleteRepeatBloks aka `DR` after extracting a candidate text-block.
I use `mc`. So far, I'm having problems re. the <pathAccess> from the
DR-script [which is in PATH], to DeleteFile. So far I've just put the
DeleteFile in the same dir as the InFile. But this tends to leave extra
scattered garbage. And since it's a one-at-time-user-PC, it may be
better to extract the DeleteFile to a fixed/default location: /tmp/R.
And then there's the refinement of how-best to handle the new-identity of
the text-file as it becomes cleaned through N-cycles.

NB. The book gets cleaner, the more it's read/used.

== TIA.

no.to...@gmail.com

unread,
Oct 25, 2012, 2:31:35 PM10/25/12
to
In article <k5mk2g$re5$1...@dont-email.me>, no.to...@gmail.com wrote:

It's perhaps getting more inapropriate to do bigger colaboration projects
in these days of superficial twitter-kiddies' short attention span?
--------------
A monster sample which tests this utility is:
http://www.commandlinefu.com/commands/

After I'd fetched 8 pages of it by:
<links append the 8 URLs-contends> to File:A

-> cat A | wc -l
showed that I had 30727 lines.

Interestingly, www.commandlinefu.com/commands/* tells
>Delete that bloated snippets file you've been using and share your
> personal repository with the world.

Our utility is intended to clean-out bloat, which commandlinefu has created.
Admittedly, their lines are small, so for byte-count they are not the worst.

Obviously, the utility simulates an editor, which can Search/Replace
RegEx. But the buffer must be able to handle 50-lines or more.
Probably emacs can do it, but I can't find how to.

The hope is to get to a stage where the redundantly-repeated text-block
can just be marked, and one-key will delete all further copies of it.

So far I'm cycling the text: A->B->A->B ..
for each block which is to be deleted, by copying the redundant block
to Da, Db, Da ..., where `EdMorton's FileCleaner awk`
uses the appropriate delete-file: Da or Db to remove all, but the first
repeats in A, B respectively. After each hopefull DeleteNextCycle I check
the new size. In this tedious testing stage, the first problem found
was non-Ascii-chars. So the input file must first filter them out.
Which is OK, except who can tell why some texts have BAD quote-chars?

An examination of the docos of `diff` is relevant:---
-E --ignore-tab-expansion
Ignore changes due to tab expansion.

-b --ignore-space-change
Ignore changes in the amount of white space.

-w --ignore-all-space
Ignore all white space.

-B --ignore-blank-lines
Ignore changes whose lines are all blank.

AFAIK, we've got the first 3, but we also need <ignore-blank-lines>.

I looked at the code, but I can't see where to patch it in.
Besides, the testing is very tedious.

Please Ed, some time, post a 'patch' to ignore-blank-lines,
and keep the code open, while I do further use/tests.


== TIA.


Ed Morton

unread,
Oct 25, 2012, 6:23:13 PM10/25/12
to
On 10/25/2012 1:31 PM, no.to...@gmail.com wrote:
<snip>
> Please Ed, some time, post a 'patch' to ignore-blank-lines,
> and keep the code open, while I do further use/tests.

I'm sorry, there's just far too much verbiage in your posts for me to take the
time to read and understand them (I already have a job!). Given the lack of
response from anyone else so far, I suspect others feel the same too so I
recommend you just:

1) Tell us briefly what you need a script to do (not why it needs to do it or
the history that led up to it).
2) Post some small, representative sample input.
3) Post the desired output you'd get from running your desired script on that
input file.
4) Post any specific question(s) you have.

Regards,

Ed.
0 new messages