The idea behind it, is that it can take a corrupt wordprocessor/DTP
file and extract the text from it. Further to that, when complete, it
should be able to tell, with a good bit of accuracy, what is text and
what is just garbage text in the file. It will then try and extract
the body of the text only, so it can be imported back into a fresh
wordprocessor/DTP file for further work.
I've got the basic non-printable/garbage character filter up and
running, with what is a complete set of fully configurable "rules" for
it, which should make altering the program for parsing foreign
language documents just a matter of altering the "rules" to allow
certain accented characters in the upper part of the ASCII set through
the filter.
I have got the functionality of the dictionary add/search back-end
written including a cache which stores the last 100 new word matches,
this has the effect of reducing file I/O and therefore speeding up the
whole operation of parsing a text file.
I have written a command line tool to add words and word lists to the
dictionary. The functionality of this is fine, but I've included the
option of logging all this stuff to a file and this is where I'm
coming un-stuck. This program works, in the sense that the word is
checked, and if found in the dictionary, no further action is taken,
otherwise it adds it. If given a word list, it'll just chug through
the list file one word to a line and add any words that don't already
exist in the dictionary. Unless it is surpressed, this activity should
all be logged either in a file specified by the user on the command
line (can append to an existing file) or to a central location within
the Resources application that holds the dictionary database and
"rules" file.
It should create either "ListLog" for wordlists or "WordLog" for
adding single words. (NB: There is no alternate location for adding
single words, although logging can still be surpressed.) It can create
the files, there is no problem there, it will log the results of the
first wordlist without a hitch, but doesn't seem to want to append to
the file after that. It doesn't replace it either! As for logging
single word additions, whatever I do, all I'm getting is an empty
file. I've been looking at this for over a week now and getting
nowhere.
What I need is a fresh pair of eyes to go over this code and see if
they can point me to a solution.
Volunteers can find it at http://www.garethlock.com/acorn/stdumper/stdump.zip
Huge thanks to anyone who would take the time to do this.
[big snip]
Look at the RISC OS port of the GNU strings utilty.
---druck
Yes, can the OP _please_ learn not to quote sigs.
>
> Look at the RISC OS port of the GNU strings utilty.
My kingdom for a URL.
'strings' is part of binutils, which is with GCC 4:
http://www.riscos.info/downloads/gccsdk/gcc-4.1.1-release-1b/gccsdk-gcc-core-bin-4.1.1-rel1b.zip
If it includes a built-in dictionary, then it is possible that I'm re-
inventing the wheel here. Else what I'm working on filters out the
garbage from the file leaving just human readable sentences in the
output. In other words, it combines the non-printable/extended ASCII
filter of the GNU 'strings' I'm aware of with an english wordlist to
perform an inteligent compare of the resulting 'strings' output. Thus
only passing true "text" to the output.
From the man page:
For each file given, GNU strings prints the printable character
sequences that are at least 4 characters long (or the number given with
the options below) and are followed by an unprintable character. By
default, it only prints the strings from the initialized and loaded
sections of object files; for other types of files, it prints the
strings from the whole file.
strings is mainly useful for determining the contents of non-text
files.
Alex.
> From the man page:
It occurs to me that if it could also handle Unicode, it would be
useful in extracting the text from many Windows documents.
(That would probably be a major extension though...)
Alan
--
Alan Adams, from Northamptonshire
al...@adamshome.org.uk
http://www.nckc.org.uk/
As it stands, the filter will quite happily un-munge a dead MS
Word97/2000 format file. After all, that's why I originally wrote it.
To recover text from this corrupt MS Word Document that a friend gave
me. She was hitting a college deadline and had to have the assignment
ready to hand in the following day. By the time I finished the coding
and ran the program, all she had left to do was import the text back
into MS Word and add the missing images & diagrams.
I've been looking at this for a while now and can't see how a
discrepancy could have got in.
I am still looking for a solution with the logging problem in
AddWords. When an open for append request is sent to IOLib its given
as the second parameter to the FNquick_openfile() call. There are two
valid values for append, these are 41 (for PRINT#) and 43 (for BPUT#).
After the initial OPENUP call in FNquick_openfile(), the value given
should have 40 negated from it before it's put in it's slot in the
tracking array. On returning, FNquick_openfile() just sets the
currently active file pointer to the slot of the file just opened and
returns the slot to the caller. From then on, the caller just refers
to it using the slot provided. Using debugging code added to AddWords
I've been able to work out that the value does initially negated by 40
as it should, but for some reason later on it comes back as 43 again!
What I can't work out is why...
Any help would be appreciated.
The latest download can be found at http://www.garethlock.com/acorn/stdumper/stdump.zip.
> I am still looking for a solution with the logging problem in
> AddWords. When an open for append request is sent to IOLib its given
> as the second parameter to the FNquick_openfile() call. There are two
> valid values for append, these are 41 (for PRINT#) and 43 (for BPUT#).
> After the initial OPENUP call in FNquick_openfile(), the value given
> should have 40 negated from it before it's put in it's slot in the
> tracking array. On returning, FNquick_openfile() just sets the
> currently active file pointer to the slot of the file just opened and
> returns the slot to the caller. From then on, the caller just refers
> to it using the slot provided. Using debugging code added to AddWords
> I've been able to work out that the value does initially negated by 40
> as it should, but for some reason later on it comes back as 43 again!
> What I can't work out is why...
>
> Any help would be appreciated.
> The latest download can be found at http://www.garethlock.com/acorn/stdumper/stdump.zip.
I have patched in a work around for the AddWords wordlist logging bug
that has been plaguing me for the last fortnight... The change is in
IOLib's PROCquick_int(), PROCquick_str() and PROCquick_real() routines
and is as follows...
The CASE statement for PROCquick_int() currently reads...
CASE quick_files%(slot%,2) OF
WHEN 0 : INPUT#...
WHEN 1 : PRINT#...
WHEN 2 : st$=GET$#...
WHEN 3 : st$=STR$(int%)...
ENDCASE
The following lines have changed to...
WHEN 1,41 : PRINT#...
WHEN 3,43 : st$=STR$(int%)...
This fixes the problem, but if the rest of the code is working as it
should, there should be no need for this ugly kludge.
WHEN 1 :
WHEN 3 :
Should read...
WHEN 1,41 :
WHEN 3,43 :
Database discrepancy was down to garbage in the database, a copule of
null strings must have got in there at some point before I hardened
the code to make sure it didn't add junk to the database. In the
process of sorting this one out, I've written the Dump Whole Database
option into the SWAdmin administration tool and re-built the database
from scratch using this list. The two totals now match up.
The dump routines have been coded into the database administration
tool. This now just leaves two features to write into this tool, they
are Index & Verify.
Other progress... I've now got hold of a dictionary and I am starting
to trawl through it entering words into the database, so in about a
week or so, I should have a database with enough words in it to begin
to code and trial the second stage filtering. This is the part that
actually takes the text output from the character filter and extracts
english sentences from the other displayable junk that crept through
the first stage.
No further progress has been made on the WIMP version yet, but that's
been put on the back burner on purpose. I've got enough to do just
trying to get the command line toolset to work before I concentrate on
the added complexities of a WIMP front-end. However, I've coded the
guts of the thing in such a way that it should be a straight drop in
as and when I can get a WIMP front-end together. If someone else
wishes to do a front-end for it, then I can of course provide the code
that deals with the back-end database I/O along with it's dependancies
and documentation on the database I/O back-end interface.
Any suggestions welcome.
The current version is available at http://www.garethlock.com/acorn/stdumper/stdump.zip.
The dictionary database is still FAR from complete and that is the
task I'm currently working on.