Received: by 10.180.95.2 with SMTP id dg2mr905206wib.2.1348422748917; Sun, 23 Sep 2012 10:52:28 -0700 (PDT) MIME-Version: 1.0 Path: ed8ni33250761wib.0!nntp.google.com!goblin1!goblin2!goblin.stu.neva.ru!aioe.org!.POSTED!not-for-mail From: Kaz Kylheku Newsgroups: comp.lang.lisp,comp.emacs Subject: Re: tricky text manipulation Followup-To: comp.lang.lisp Date: Sun, 23 Sep 2012 17:52:23 +0000 (UTC) Organization: Aioe.org NNTP Server Lines: 72 Message-ID: <20120923103407.343@kylheku.com> References: NNTP-Posting-Host: X+c6YNb3AaWMPA3YfA4opg.user.speranza.aioe.org X-Complaints-To: abuse@aioe.org User-Agent: slrn/pre1.0.0-18 (Linux) X-Notice: Filtered by postfilter v. 0.8.2 On 2012-09-20, no.top.p...@gmail.com wrote: > But now, the contents: A,B,C,D will contain common, > repeated/redundant text, which usually is near the > page-beginning, but must be handled anywhere is the page. The problem with your problem description is that it is described in terms of these abstract hypotheticals. However, the details depend on the actual data. You must provide some concrete instances of the actual catenated web pages, rather than "uA|vB|wC|xD|". Also show the exact output that you want from each input sample (i.e. each set of catenated web pages). Since these things are probably large, you should put them on some file hosting site (perhaps as a compressed archive) and give a URL. Without these reference input/output pairs, it is impossible to write, debug, test, and refine a piece of software. If you keep the exact input and output pairs to yourself (like you did throughout the entire large thread in the other newsgroups where you posted this) it will just be another big waste of time. > So then whith the garbage represented as 8, it looks like: > u8A|v8B|wC8|x8D| > > The algorithm that I see is: > > the human starts reading/editing accumulatorFile, > and notices the/some repeated/redundant/garbage, > which be pastes out to FileH; That, right off the bat, is not an algorithm. An algorithm must describe how garbage is recognized and delimited. The above part is essentially programming. The human prepares a specification of what is "redundant garbage". > and then the program does: And now the algorithm begins: > scan accumulatorFile, and delete all copies of the > text-block:H, except the frst copy. It's probably much more productive to look for what to *keep*, rather than what to *delete*. That is to say, take the catenated web pages and scrape them for interesting content, skipping uninteresting content. > That should'nt be difficult, but the following refinement > is also required: > for matching purposes, ignore all "["d{d}"]" and > spaces and tabs. What is this notation "[" d{d} "]" ? Is it BNF? You can't just drop random notations in the middle of a sentence without explaining what they are, and expect to be perfectly understood. > So the following 2 lines should match: > the [4] cat sat on the mat > the [27] cat sat on the mat Ah, of course, d means digit? Oh, stupid me! Would it kill you to write something like: ignore all "[" d { d } "]" where this is EBNF notation, and the grammar symbol d stands for a digit?