Filter is dropping words...
This seems to happen when the word in question spans two "lines". (I
define lines here as 255 characters.) What I'm doing at the moment is
to read in a line from the input file at a time and split this into
words, which I then check against the database. If it comes back as
found, then I write this to the output line and continue to the next
word. If not found, then it enters it into the unknowns list to be
written to a file when the program finishes with the document. What I
think I should be doing is checking if the word is in the dictionary,
if not, then are we at the end of the line? If we are then store it
and move on to the beginning of the next line. When we see a word
which isn't in the dictionary and we're beginning a new line, then we
add the current word to the end of the stored word to see if we get a
match. I've tried variations of this a couple of times now, with
different, but un-desired results.
AddWords single word logging...
Still haven't got anywhere with this long standing bug. A full
description of this one is in the !ReadMe file distributed with the
archive.
You can download what I have so far from http://www.garethlock.com/acorn/stdumper/stdump.zip
Done a little optimisation here and there... Managed to shave 2k off
the SmartWord API (Libs.SmartW). Also got word counts for each start
letter right aligned properly in the database statistics report
options from within SWAdmin.
This seems to have a few more bugs introduced. For some reason, the
output produced by Filter when SmartWord filtering is turned on loses
all spaces between words. I have no idea why as I haven't made any
changes to this. The main part of the update has been to expand the
dictionary to over 5000 words. Still a long way to go, but the program
is now at a stage where it can scan a document and produce a list of
unknowns from it. This is then manually tidied up and inserted into
the database. Yes... I do put each of the test documents through a
spell-checker BEFORE I stick them through this, so all words are
correctly spelt.
As usual. You can find the latest update at http://www.garethlock.com/acorn/stdumper/stdump.zip
Hopefully by using LibASH blocks, rather than BASIC strings, I can get
around the 255 character limit that's causing words to split between
lines on occasion.
Anyhow... For those of you that are following, the latest download is
at...