Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Non-email use of the spambayes project

0 views
Skip to first unread message

Skip Montanaro

unread,
Mar 27, 2003, 10:10:55 PM3/27/03
to

I've successfully applied the Spambayes code (http://spambayes.sf.net/) to a
non-email application today and thought I'd pass the concept along to
others. Many of you on c.l.py probably are aware of the Spambayes project
which relies on user segregation of a set of email messages into spam and
ham, then combines the resulting clues they contain to predict the hamminess
or spamminess of email messages it hasn't seen before. It works extremely
well for this, but the basic concept is applicable to other classification
problems.

I've operated the Mojam and Musi-Cal websites for several years. Over that
time we've accumulated a sizable venue database. Unfortunately, many
entries in the database have become stale and don't contribute anything to
the system other than to slow down queries. Venue names get misspelled,
venues go out of business, non-venue stuff slips into the database, or other
errors occur. As a result, I had a venue database containing roughly 35,000
entries, only about half of which were referenced by concert items in the
database. The database as it sat couldn't be licensed to potential
customers because of all the errors it contained. I could simply delete all
of those entries, but that would delete a lot of useful content from the
database. Many of those currently unreferenced venue entries *are* correct
and will eventually be associated with other concerts, or will be useful as
corollary information for people using our websites or as an extra database
we can license to content consumers.

I wrote a trivial little application today which allowed me to rummage
through the unreferenced records in the database. I could delete entries
which I felt were incorrect, but it was a one-at-a-time process. With
15,000+ entries to scan, one-by-one wasn't going to cut it.

Then I got the idea to use the Spambayes classifier to watch what I was
doing and train on my actions. I was viewing the records in chunks of 20
items at a time, sorted alphabetically. I could choose to delete one or
more items or move onto the next chunk of 20 entries. A deletion caused the
classifier to be trained on the entry as "spam". Moving onto the next chunk
caused the classifier to be trained on the remaining undeleted entries as
"ham". Over a short period of time, it got reasonably good at identifying
"spam". I then started sorting each chunk of 20 items by its spambayes
score and could specify a threshold score below which to eliminate all
entries in that chunk.

The next improvement was to sort the entire mess of records by the spambayes
classification. I was then seeing entire chunks of records whose scores
fell below the threshold and was able to delete them 20 at a time.

The entire Spambayes code is a single tokenizer generator function and a
small Classifier class:

import spambayes.storage

class Classifier:
def __init__(self):
self.cls = spambayes.storage.DBDictClassifier("fven.db")

def classify(self, d):
return self.cls.spamprob(tokenize(d), True)

def train(self, d, saved):
self.cls.learn(tokenize(d), saved)

def __del__(self):
self.cls.store()

def tokenize(d):
# d is a dictionary as returned by a MySQL query - tokenize the
# various fields, noting interesting facts
yield "venue length:%d" % len(d["venue"])
for word in d["venue"].split():
# looks like a festival - not a venue at all
if word.lower().endswith("fest"):
yield "venue:<fest>"
yield "venue:"+word
# most correct venue names don't contain punctuation
if (string.translate(d["venue"], null_xlate, string.punctuation)
!= d["venue"]):
yield "venue:<punctuation>"
# no address information for this venue - less valuable
if not d["addr1"]:
yield "addr1:<empty>"
elif d["addr1"][0] not in string.digits:
# most valid addresses in the US/Canada begin with a street number
yield "addr1:<no number>"
for word in d["addr1"].split():
yield "addr1:"+word
for word in d["addr2"].split():
yield "addr2:"+word
yield "phone:"+d["phone"]
yield "city:"+d["city"].strip()
yield "region:"+(d["state"].strip() or d["country"].strip())
yield "zip:"+d["zip"]
# sometimes the city gets replicated in the address, making the
# data "dirtier" and thus less valuable
vwords = d["venue"].lower().split()
for word in d["city"].lower().split():
if word in vwords:
yield "city:<in venue>"
break
# the record's id reflects its age - older records, and thus
# smaller ids, are more likely to be outdated
try:
yield "id:2**%.0f" % math.log(int(d["id"]) // 100)
except OverflowError:
yield "id:2**0"
return

...

classifier = Classifier()

The input to the tokenizer, instead of being an email message, is a
dictionary representing the return value from an SQL query. When an item is
to be deleted, it gets classified like so:

classifier.train(d, False)

When moving the the next chunk, the remaining records are classified like
so:

for item in chunk:
classifier.train(item, True)

I haven't gotten too crazy with the tokenizer (compare it with the Spambayes
tokenizer!). I will probably collect some other clues in the tokenizer,
such as what other tables reference the venue record. For the time being,
it's working okay. I just need it to do a reasonably good job segregating
records so I can quickly scan a group and make a deletion decision. So far,
it's doing a very good job. Not bad for 15-30 minutes of work...

Skip


0 new messages