Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Can anyone please help me with this one??

0 views
Skip to first unread message

Gazza

unread,
Sep 14, 2009, 12:58:38 PM9/14/09
to
I know this is a big ask for people out there who, like me, are doing
this as a hobby or interest in their spare time. I can't pay anybody
anything and I'm not getting paid for writing this thing. However,
once complete it WILL be freely available to whomever wants a copy of
it.

The idea behind it, is that it can take a corrupt wordprocessor/DTP
file and extract the text from it. Further to that, when complete, it
should be able to tell, with a good bit of accuracy, what is text and
what is just garbage text in the file. It will then try and extract
the body of the text only, so it can be imported back into a fresh
wordprocessor/DTP file for further work.

I've got the basic non-printable/garbage character filter up and
running, with what is a complete set of fully configurable "rules" for
it, which should make altering the program for parsing foreign
language documents just a matter of altering the "rules" to allow
certain accented characters in the upper part of the ASCII set through
the filter.

I have got the functionality of the dictionary add/search back-end
written including a cache which stores the last 100 new word matches,
this has the effect of reducing file I/O and therefore speeding up the
whole operation of parsing a text file.

I have written a command line tool to add words and word lists to the
dictionary. The functionality of this is fine, but I've included the
option of logging all this stuff to a file and this is where I'm
coming un-stuck. This program works, in the sense that the word is
checked, and if found in the dictionary, no further action is taken,
otherwise it adds it. If given a word list, it'll just chug through
the list file one word to a line and add any words that don't already
exist in the dictionary. Unless it is surpressed, this activity should
all be logged either in a file specified by the user on the command
line (can append to an existing file) or to a central location within
the Resources application that holds the dictionary database and
"rules" file.

It should create either "ListLog" for wordlists or "WordLog" for
adding single words. (NB: There is no alternate location for adding
single words, although logging can still be surpressed.) It can create
the files, there is no problem there, it will log the results of the
first wordlist without a hitch, but doesn't seem to want to append to
the file after that. It doesn't replace it either! As for logging
single word additions, whatever I do, all I'm getting is an empty
file. I've been looking at this for over a week now and getting
nowhere.

What I need is a fresh pair of eyes to go over this code and see if
they can point me to a solution.
Volunteers can find it at http://www.garethlock.com/acorn/stdumper/stdump.zip

Huge thanks to anyone who would take the time to do this.

druck

unread,
Sep 14, 2009, 6:10:27 PM9/14/09
to
Gazza wrote:
> The idea behind it, is that it can take a corrupt wordprocessor/DTP
> file and extract the text from it.

[big snip]

Look at the RISC OS port of the GNU strings utilty.

---druck

Peter Naulls

unread,
Sep 14, 2009, 6:17:05 PM9/14/09
to
druck wrote:
> Gazza wrote:
>> The idea behind it, is that it can take a corrupt wordprocessor/DTP
>> file and extract the text from it.
>
> [big snip]

Yes, can the OP _please_ learn not to quote sigs.

>
> Look at the RISC OS port of the GNU strings utilty.

My kingdom for a URL.

'strings' is part of binutils, which is with GCC 4:

http://www.riscos.info/downloads/gccsdk/gcc-4.1.1-release-1b/gccsdk-gcc-core-bin-4.1.1-rel1b.zip


Gazza

unread,
Sep 14, 2009, 7:32:34 PM9/14/09
to
> http://www.riscos.info/downloads/gccsdk/gcc-4.1.1-release-1b/gccsdk-g...

If it includes a built-in dictionary, then it is possible that I'm re-
inventing the wheel here. Else what I'm working on filters out the
garbage from the file leaving just human readable sentences in the
output. In other words, it combines the non-printable/extended ASCII
filter of the GNU 'strings' I'm aware of with an english wordlist to
perform an inteligent compare of the resulting 'strings' output. Thus
only passing true "text" to the output.

Alex Macfarlane Smith

unread,
Sep 15, 2009, 3:51:12 AM9/15/09
to
Gazza wrote:
> On Sep 14, 11:17 pm, Peter Naulls <pe...@chocky.org> wrote:
[snip]

>>
>> 'strings' is part of binutils, which is with GCC 4:
>>
>> http://www.riscos.info/downloads/gccsdk/gcc-4.1.1-release-1b/gccsdk-g...
>
> If it includes a built-in dictionary, then it is possible that I'm re-
> inventing the wheel here. Else what I'm working on filters out the
> garbage from the file leaving just human readable sentences in the
> output. In other words, it combines the non-printable/extended ASCII
> filter of the GNU 'strings' I'm aware of with an english wordlist to
> perform an inteligent compare of the resulting 'strings' output. Thus
> only passing true "text" to the output.

From the man page:

For each file given, GNU strings prints the printable character
sequences that are at least 4 characters long (or the number given with
the options below) and are followed by an unprintable character. By
default, it only prints the strings from the initialized and loaded
sections of object files; for other types of files, it prints the
strings from the whole file.

strings is mainly useful for determining the contents of non-text
files.

Alex.

Alan Adams

unread,
Sep 15, 2009, 1:40:53 PM9/15/09
to
In message <4aaf4770$0$288$1472...@news.sunsite.dk>

> From the man page:

It occurs to me that if it could also handle Unicode, it would be
useful in extracting the text from many Windows documents.

(That would probably be a major extension though...)

Alan


--
Alan Adams, from Northamptonshire
al...@adamshome.org.uk
http://www.nckc.org.uk/

Gazza

unread,
Sep 16, 2009, 10:46:38 AM9/16/09
to
On Sep 15, 6:40 pm, Alan Adams <a...@adamshome.org.uk> wrote:
> In message <4aaf4770$0$288$14726...@news.sunsite.dk>
> a...@adamshome.org.ukhttp://www.nckc.org.uk/- Hide quoted text -
>
> - Show quoted text -

As it stands, the filter will quite happily un-munge a dead MS
Word97/2000 format file. After all, that's why I originally wrote it.
To recover text from this corrupt MS Word Document that a friend gave
me. She was hitting a college deadline and had to have the assignment
ready to hand in the following day. By the time I finished the coding
and ran the program, all she had left to do was import the text back
into MS Word and add the missing images & diagrams.

Gazza

unread,
Sep 21, 2009, 2:49:00 PM9/21/09
to
Made some progress with other parts of the project. You can now delete
words as well as adding them. The SWord I/O back-end code now has
support for dumping out word lists from the contents of the database
and enumerating the contents of the database, although the options for
dumping word lists haven't been written into the new administration
tool yet. The administration tool can add and delete words and provide
a statistical analsys of the database. However, this is where I'm
hitting another bug. I have one routine in the back-end that counts
the total number of words in the whole database and one that counts
them by letter. If I total up these counts for the 26 letters of the
alphabet, I seem to be coming up two words short of the figure I get
from the routine that counts the total number of words in the whole
database. These routines are FNsmart_enumdatabase, for the whole
database, and FNsmart_enumwordsbyletter() for counting each letter
seperately.

I've been looking at this for a while now and can't see how a
discrepancy could have got in.

I am still looking for a solution with the logging problem in
AddWords. When an open for append request is sent to IOLib its given
as the second parameter to the FNquick_openfile() call. There are two
valid values for append, these are 41 (for PRINT#) and 43 (for BPUT#).
After the initial OPENUP call in FNquick_openfile(), the value given
should have 40 negated from it before it's put in it's slot in the
tracking array. On returning, FNquick_openfile() just sets the
currently active file pointer to the slot of the file just opened and
returns the slot to the caller. From then on, the caller just refers
to it using the slot provided. Using debugging code added to AddWords
I've been able to work out that the value does initially negated by 40
as it should, but for some reason later on it comes back as 43 again!
What I can't work out is why...

Any help would be appreciated.
The latest download can be found at http://www.garethlock.com/acorn/stdumper/stdump.zip.

Gazza

unread,
Sep 22, 2009, 8:04:52 AM9/22/09
to
On Sep 21, 7:49 pm, Gazza <use...@garethlock.com> wrote:

> I am still looking for a solution with the logging problem in
> AddWords. When an open for append request is sent to IOLib its given
> as the second parameter to the FNquick_openfile() call. There are two
> valid values for append, these are 41 (for PRINT#) and 43 (for BPUT#).
> After the initial OPENUP call in FNquick_openfile(), the value given
> should have 40 negated from it before it's put in it's slot in the
> tracking array. On returning, FNquick_openfile() just sets the
> currently active file pointer to the slot of the file just opened and
> returns the slot to the caller. From then on, the caller just refers
> to it using the slot provided. Using debugging code added to AddWords
> I've been able to work out that the value does initially negated by 40
> as it should, but for some reason later on it comes back as 43 again!
> What I can't work out is why...
>
> Any help would be appreciated.
> The latest download can be found at http://www.garethlock.com/acorn/stdumper/stdump.zip.

I have patched in a work around for the AddWords wordlist logging bug
that has been plaguing me for the last fortnight... The change is in
IOLib's PROCquick_int(), PROCquick_str() and PROCquick_real() routines
and is as follows...

The CASE statement for PROCquick_int() currently reads...

CASE quick_files%(slot%,2) OF
WHEN 0 : INPUT#...
WHEN 1 : PRINT#...
WHEN 2 : st$=GET$#...
WHEN 3 : st$=STR$(int%)...
ENDCASE

The following lines have changed to...

WHEN 1,41 : PRINT#...
WHEN 3,43 : st$=STR$(int%)...

This fixes the problem, but if the rest of the code is working as it
should, there should be no need for this ugly kludge.

Gazza

unread,
Sep 22, 2009, 8:08:59 AM9/22/09
to
This WHEN clause fix should be applied to all the routines mentioned
in the previous post...

WHEN 1 :
WHEN 3 :

Should read...

WHEN 1,41 :
WHEN 3,43 :

Gazza

unread,
Sep 23, 2009, 11:03:46 AM9/23/09
to
On Sep 21, 7:49 pm, Gazza <use...@garethlock.com> wrote:
> However, this is where I'm
> hitting another bug. I have one routine in the back-end that counts
> the total number of words in the whole database and one that counts
> them by letter. If I total up these counts for the 26 letters of the
> alphabet, I seem to be coming up two words short of the figure I get
> from the routine that counts the total number of words in the whole
> database. These routines are FNsmart_enumdatabase, for the whole
> database, and FNsmart_enumwordsbyletter() for counting each letter
> seperately.
>
> I've been looking at this for a while now and can't see how a
> discrepancy could have got in.
>
> Any help would be appreciated.

Database discrepancy was down to garbage in the database, a copule of
null strings must have got in there at some point before I hardened
the code to make sure it didn't add junk to the database. In the
process of sorting this one out, I've written the Dump Whole Database
option into the SWAdmin administration tool and re-built the database
from scratch using this list. The two totals now match up.

Gazza

unread,
Sep 25, 2009, 6:17:37 PM9/25/09
to
Further progress has been made. FNsmart_prepword() is now numeric safe
and has a whole host more symbols added to prevent stray garbage from
getting through to the database. I think I've found out how the null
strings got into the database in the first place and plugged the gap.
(This was down to numeric words such as 200 getting stripped by
FNsmart_prepword() thus returning a null string which somehow got past
the null string check. Don't know how though, as the null string check
was performed on words AFTER they had been passed through
FNsmart_prepword()!!)

The dump routines have been coded into the database administration
tool. This now just leaves two features to write into this tool, they
are Index & Verify.

Other progress... I've now got hold of a dictionary and I am starting
to trawl through it entering words into the database, so in about a
week or so, I should have a database with enough words in it to begin
to code and trial the second stage filtering. This is the part that
actually takes the text output from the character filter and extracts
english sentences from the other displayable junk that crept through
the first stage.

No further progress has been made on the WIMP version yet, but that's
been put on the back burner on purpose. I've got enough to do just
trying to get the command line toolset to work before I concentrate on
the added complexities of a WIMP front-end. However, I've coded the
guts of the thing in such a way that it should be a straight drop in
as and when I can get a WIMP front-end together. If someone else
wishes to do a front-end for it, then I can of course provide the code
that deals with the back-end database I/O along with it's dependancies
and documentation on the database I/O back-end interface.

Any suggestions welcome.

Gazza

unread,
Oct 1, 2009, 12:50:45 PM10/1/09
to
Progress has been made into adding a preliminary SmartWord filter into
the Filter tool. It mostly works, there are a couple of issues with it
at the moment. In a test document I'm using the word "patterns" in the
original document seems to get mangled to "atterns" in the unknown
words list dump and therefore gets culled from the final output when
the cull "-c" option is enabled for some reason. This seems to happen
mostly when it's the first word on the input line, though not always.
Also, it's not catching all the junk and some still gets through,
although it is a good deal less than with the character filter alone.
I need to take a look at this further.

The current version is available at http://www.garethlock.com/acorn/stdumper/stdump.zip.
The dictionary database is still FAR from complete and that is the
task I'm currently working on.

0 new messages