Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
EdMorton's FileCleaner awk Testresults
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  3 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
no.top.p...@gmail.com  
View profile  
 More options Oct 17 2012, 11:49 am
Newsgroups: comp.lang.awk, comp.os.linux.misc
From: no.top.p...@gmail.com
Date: Wed, 17 Oct 2012 15:49:37 +0000 (UTC)
Local: Wed, Oct 17 2012 11:49 am
Subject: EdMorton's FileCleaner awk Testresults
     Did I fail to get these test resulsts for:
Subject: Re: Re (6): Is awk suitable for this?
Date: Mon, 17 Sep 2012 14:13:49 GMT
   back within a month?

Since this utility's Ver1 has proven successful, it's appropriate to
expalin further why WE need it.

Most info from the net, these days is fetched via http - regretable IMO.
So there are aZillion webPage designers all following the same template.
Like lawers who'll never use ten words, when tenThousand will do, they
fill out each web-page that you fetch with repetative bloat.

Consider the pattern from decades ago, where you needed to extact a 6 part
article from a monthly periodical, where the average article-part size was
2 & half pages, for an average periodical size of 40 pages.
So too today, if you fetched your http-info by lynx/links/elinks, instead
of having the text wrapped in pictures, you'd notice that the valuable
text-info is wrapped in repetative/garbage. It's garbage because you
only need ONE copy.

Inevitably if you need some info from a site, you'll need to get several
'pages'. Which means that the redundant packaging will be the same for
each page fetched.

In order to BuildAbook of the particular info, I use a script(s) which does:
  ForEachLineInFile  ListOfURLs
    AppendToFile BookFile:
      TheUrl
      TheFetchedText
      AsectionSeparatorLine"<><><>"

So when I [pay to] go online, to get my email, a 2line script may run:
  g1277 medL <medicalDir>/<BookTitle>
  g1277  electroL   <electro>/<elBookTitle>
-------
 where `g1277` is the above described script, which needs 1, 2 args
 and does FetchTheTextMax: 77Char LineLen.

So then after the email is d/l-ed, the 2 books, already in their
appropriate directory, are initiated or expanding.
But to clean-out the annoying repetitions, we need:
EdMorton's FileCleaner awk; which is coded in the above mentioned thread.

===============> here are some test results for vers1.
=> extract to R some <lines to be deleted>
-> awk -f tst.awk R /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d  > J14dT
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/J14d | wc -l  == 2961
-> cat  J14dT | wc -l == 2931 Good! R is 2 lines; so 15 repeats were removed.
----------------
==> extract another 5-lines to R2
-> awk -f tst.awk R2   J14dT >  J14dT2
-> cat  J14dT2 | wc -l == 2856 ==> 5 * 15 == Good.
----------------
==> extract a 1-liner to R3
-> awk -f tst.awk R3   J14dT2 >  J14dT3
-> cat  J14dT3 | wc -l == 2856  = none removed ?
-> cat R3 ==
   [67]View profile
-> ls -l J14dT* ==
-rw-r--r-- 1 root root 133614 2012-10-16 21:32 J14dT
-rw-r--r-- 1 root root 129813 2012-10-16 21:39 J14dT2
-rw-r--r-- 1 root root 129813 2012-10-16 21:48 J14dT3
=> perhaps there's a problem with one-liners, or I forgot EOL?
----------------
-> awk -f tst.awk  S /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 > D12a
-> cat /mnt/hdc11/Inet/USENET/GroupsIndx/OwnWebPage/D12 | wc -l == 9019
-> cat D12a | wc -l == 8863 = good.
----------------
==> let's try deleteing one-and-half lines ?
-> awk -f tst.awk  S1.5  D12a >  D12b
-> cat D12b | wc -l == 8863
-> cat S1.5 ==
   [192]View profile
    More options
=> the second-line had no EOL. Need more investigation.
----------------
=> manually viewing: there are many "View profile"
eg:"    [101]View profile"
? Does this mean that <DeleteFile> must be 1 < LineCount
That's not a big-problem.

An interesting  minor error is that the tail of the InFile
 is changed from eg:---------------
 175. http://www.google.com/preferences?hl=en
<><><><><>
----------------  to:

 175. http://www.google.com/preferences?hl=en
<><><><><>

<><><><>
----------------, not repetatively, but only for the
first cleaning-cycle of the file.
So the <EOL>"<><><><><>" is appened once only.

More importantly, I need to be able to roam the file tree, and apply
DeleteRepeatBloks aka `DR` after extracting a candidate text-block.
I use `mc`.  So far, I'm having problems re. the <pathAccess> from the
DR-script [which is in PATH], to DeleteFile. So far I've just put the
DeleteFile in the same dir as the InFile. But this tends to leave extra
scattered garbage. And since it's a one-at-time-user-PC, it may be
better to extract the DeleteFile to a fixed/default location: /tmp/R.
And then there's the refinement of how-best to handle the new-identity of
the text-file as it becomes cleaned through N-cycles.

NB. The book gets cleaner, the more it's read/used.

== TIA.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
no.top.p...@gmail.com  
View profile  
 More options Oct 25 2012, 2:31 pm
Newsgroups: comp.lang.awk, comp.os.linux.misc
From: no.top.p...@gmail.com
Date: Thu, 25 Oct 2012 18:31:35 +0000 (UTC)
Local: Thurs, Oct 25 2012 2:31 pm
Subject: Re: EdMorton's FileCleaner awk Testresults

In article <k5mk2g$re...@dont-email.me>, no.top.p...@gmail.com wrote:

It's perhaps getting more inapropriate to do bigger colaboration projects
in these days of superficial twitter-kiddies' short attention span?
--------------
A monster sample which tests this utility is:
http://www.commandlinefu.com/commands/

After I'd fetched 8 pages of it by:
<links append the 8 URLs-contends> to File:A

-> cat A | wc -l
showed that I had 30727 lines.

Interestingly, www.commandlinefu.com/commands/*  tells

>Delete that bloated snippets file you've been using and share your
>  personal repository with the world.

Our utility is intended to clean-out bloat, which commandlinefu has created.
Admittedly, their lines are small, so for byte-count they are not the worst.

Obviously, the utility simulates an editor, which can Search/Replace
RegEx. But the buffer must be able to handle 50-lines or more.
Probably emacs can do it, but I can't find how to.

The hope is to get to a stage where the redundantly-repeated text-block
can just be marked, and one-key will delete all further copies of it.

So far I'm cycling the text: A->B->A->B ..
for each block which is to be deleted, by copying the redundant block
to Da, Db, Da ..., where `EdMorton's FileCleaner awk`
uses the appropriate delete-file: Da or Db to remove all, but the first
repeats in A, B respectively. After each hopefull DeleteNextCycle I check
the new size.  In this tedious testing stage, the first problem found
was non-Ascii-chars. So the input file must first filter them out.
Which is OK, except who can tell why some texts have BAD quote-chars?

An examination of the docos of `diff` is relevant:---
       -E  --ignore-tab-expansion
              Ignore changes due to tab expansion.

       -b  --ignore-space-change
              Ignore changes in the amount of white space.

       -w  --ignore-all-space
              Ignore all white space.

       -B  --ignore-blank-lines
              Ignore changes whose lines are all blank.

AFAIK, we've got the first 3, but we also need <ignore-blank-lines>.

I looked at the code, but I can't see where to patch it in.
Besides, the testing is very tedious.

Please Ed, some time, post a 'patch' to ignore-blank-lines,
and keep the code open, while I do further use/tests.

== TIA.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ed Morton  
View profile  
 More options Oct 25 2012, 6:23 pm
Newsgroups: comp.lang.awk, comp.os.linux.misc
From: Ed Morton <mortons...@gmail.com>
Date: Thu, 25 Oct 2012 17:23:13 -0500
Local: Thurs, Oct 25 2012 6:23 pm
Subject: Re: EdMorton's FileCleaner awk Testresults
On 10/25/2012 1:31 PM, no.top.p...@gmail.com wrote:
<snip>

> Please Ed, some time, post a 'patch' to ignore-blank-lines,
> and keep the code open, while I do further use/tests.

I'm sorry, there's just far too much verbiage in your posts for me to take the
time to read and understand them (I already have a job!). Given the lack of
response from anyone else so far, I suspect others feel the same too so I
recommend you just:

1) Tell us briefly what you need a script to do (not why it needs to do it or
the history that led up to it).
2) Post some small, representative sample input.
3) Post the desired output you'd get from running your desired script on that
input file.
4) Post any specific question(s) you have.

Regards,

    Ed.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »