I have uploaded the source for this project to
borland.public.attachments, with the Subject: Bayesian Classifier
in Delphi (Source)
For those who like to live dangerously, I can e-mail them a
compiled version of this program. borland.public.attachments
does not seem to allow me to let me upload it there, even if I
Zip it. Maybe I am doing something incorrectly.
The program uses Naive Bayesian Classification to compute the
probabilities of how well a given document matches several
possible categories. Two typical categories might be: 'Spam' and
'Good'. However, the program can accept as many categories as
you would like: 'Spam', 'Work', 'Personal', 'Hobby', etc.
This exercise on my part was not aimed at developing a finished
product. I was only trying to see what kind of simple data
structures and algorithms could be constructed that were
sufficiently discriminating between different categories of
documents.
I explored Bayesian Classification not to spot Spam but as a way
to classify documents into categories. Therefore, there are no
facilities to handle POP3 and the like in this program. The
program just reads a file and classifies it. Nothing fancy... or
terribly useful!
Objects.
TBayesBase = class(TObject)
This is the base object that contains a StringList that is used
to hold the tokens/words of a document to classify or the
token/words in a Category Database.
TBayesDB = class(TBayesBase)
This is a Category Database. One of these is created for each
category, 'Spam', 'Good', etc. It contains the routines to
manage the token database. It also contains routines to compute
token percentages and contains fields used during the
classification process.
TBayesDoc = class(TBayesBase)
This contains the document to be classified as well as the
parsing and Bayes Classification routines.
To put together a program to do Bayesian Classification, create
one instance of a TBayesDoc. Its Parse and Classify routines can
be called repeatedly with different file names.
Also, create a TBayesDB for each category of interest.
Included in the download is a test program (Bayesian) that uses
the above Objects to classify documents.
The program was developed in Delphi 3 (Hey, it was handy!) and
uses routines from TuboPower's SysTools
(http://sourceforge.net/projects/tpsystools/) and from HyperStr
(http://www.mindspring.com/~efd/hyperstr.htm). The program
doesn't use routines from these packages extensively, mainly for
file name and string manipulation. Their usage could be replaced
without too much trouble. The biggest headache might be the
TreeView I use in the Demo program to show the databases and
classification results. It is from Woll2Woll Software's 1StClass
(http://www.woll2woll.com/). However, I am using it as just a
basic TreeView with checkboxes. The checkboxes are used to take
a Database out of consideration when doing classification. Each
node in the TreeViw holds a pointer to a TBayesDB Object.
Things You Might Want to Change.
The Parser (TBayesDoc.Parse). I only wanted to consider words
(tokens) composed of alphabetic characters. Also, I did not want
to have to know anything about the structure/layout of the
document to classify, such as for e-mail. It was just a text
file. For example, the "Subject" line was no more interesting or
special than any other part of the e-mail. I did not want to
have special handling for HTML, or anything else. However, there
is one exception: Base64. I chose to "kind of" handle Base64
strings by spotting them and then throwing them away!
Token Database Management (TBayesDB.PruneDB). A Database can
fill up with junk if you are processing e-mail. Spammers have
started using nonsense words and random letter grouping to
pollute Bayesian Spam filters. It doesn't work, but it does make
getting rid of them a chore.
Classification (TBayesDoc.Classify).
I do not have a deep understanding of Bayesian statistics :-) and
the many ways it might be used to accomplish what I wanted to do.
Lots of improvements could be made here. There are many
different ways to do this, most I don't fully understand! I
picked up useful ideas from the following sites:
http://popfile.sourceforge.net/
http://spambayes.sourceforge.net/
How to use the program.
The program does not have to be installed. Just put it somewhere
and run it.
1. Create some databases. In the "New Database" section type a
Name and click on "Create".
2. Load the Databases. In the "Batch Load" section choose a
directory containing text files of a specific type/category.
Highlight a database and then click "Load". Or, you can load
files individually. See next step.
3. In the "Documents" section choose a text file. Then click on
"Parse". At this point you can highlight a database and click on
"Add Document To". The words/tokens in the document will then
be added to the selected Database.
4. Once you have a sufficient number of documents loaded into
the various Databases (i.e., the program is "trained"), select a
new document and parse it. This time click on "Classify". The
list of Databases will now reflect the probability of that
document belonging to each of the database categories.
5. If, in the above step, the program miss-classified the
document, or the document was not classified with a high enough
percentage to suit you, then highlight the database the document
should belong to and click on "Reclassify Document As".
6. Prune. Clean out the junk tokens/words. There are better
ways to do this. You don't really have to do this at all.
- David Harper dha...@houston.rr.com
- David Harper