Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Re (3): tricky text manipulation

17 views

Skip to first unread message

no.to...@gmail.com

unread,

Oct 31, 2012, 5:21:46 AM10/31/12

In article <>, nobody wrote:
> In article <20120923...@kylheku.com>, Kaz Kylheku <k...@kylheku.com> wrote:

Perhaps we can mutually benefit, by considering the results of EdMorton?

> Without these reference input/output pairs, it is impossible to write,
> debug, test, and refine a piece of software.
>
> If you keep the exact input and output pairs to yourself (like you did
> throughout the entire large thread in the other newsgroups where you
> posted this) it will just be another big waste of time.

Yes that makes sense in the context of what you write here:

> I designed an entire text-scraping programming language
> which does complex multi-line
> pattern matching over entire documents. Faced with your
> text-scraping task, I would just use that language. That still requires
> programming: looking at the data, considering the output, forming a
> text-extraction strategy, and expressing it in the language.

That seems very interesting, and much more complex than
the tool that I'm building for my self, to cater for my
<not driven by the WinTel herd> Inet methods.
Your's almost seems to go into AI ?

You've no doubt, been following my dialog with Ed Morton,
and I've just found an extra requirement, that I'd overlooked.
As you well know, for many tasks, we can't know the full spec
beforehand -- or ever; because we don't know <what we
don't know>. EM's current version doesn't <handle>
<alphas> in square-brackets, apparently because my big
emphasis on its need to <handle> <digits> in square-brackets.

It's interesting to note how we approach problems, biased
by our recent experiences: what led me into this 'algorithm':
WHILE SequentiallyReadingAfile
CreateAsub-algorithmTo
(EditForwardText LikeHereTextWasEdited)

This came up while <cleaning up> OCRs.
So, if you find that the first OCRed "first" appears as "xqrst",
it's likely that all further "first" and evan all words with "fi"
are similarly wrongly OCRed.
So, you could 'automate' the action-sequence [perhaps
initially, just writing it down, for reading/instead-of-REthinking.

Are you prepared to discuss the algorithm of your system?

Have I adequately explained that my requirement, CAN'T
extract/scrape the good-stuff out, because you don't KNOW
the good-stuff; else you wouldn't be fetching, what you
already know?

I've been using EM's awk-code on text-files of > 1 MB,
but it's still very tedious, becaus it's not yet integrated into
an editor. So I need to repeatedly loop these 4 steps:--

* read down InFile and extract next-garbage to DeleteFile
* Do: DeleteRepeats DeleteFile InFile > InFileB
* read down InFileB and extract next-garbage to DeleteFile
* Do: DeleteRepeats DeleteFile InFileB > InFile

== TIA.

PS. do you work with emacs & how can emacs user cope with
3 different syntax for <grep>?!
What about integrating the above 4-steps into an emacs
based editor?
Can emacs easily hook-in awk-code?

0 new messages