Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Markov-based SPAM filter: questions about statistical analysis

0 views

Skip to first unread message

Timothy Miller

unread,

Sep 2, 2002, 10:30:53 PM9/2/02

For some time now, I have been working on a Markov model based spam
filter. The algorithm I have for judging emails against the training
set of spam and non-spam seems to work quite well. Unfortunately
having no background in statistics, I don't really know WHY it work
this well, and having an answer to that would help me to improve not
only reliability but the usefulness of the information reported. I do
want to understand this fully, so don't hesitate to get into a
discussion with significant depth.

For those interested in seeing the algorithm, I have made it available
here:

http://home.cfl.rr.com/theovon/xpmntr.tar

In case you're wondering, my approach was not actually inspired by the
recent development by LISP guru Paul Graham. I did enjoy his article,
and it inspired me to write this post asking for ideas that might help
me to refine what my spam filter reports.

Background

First, a description of the how my spam filter works right now.
Rather than making a Markov model based on words, I decided to work
with raw characters. There are advantages and disadvantages to this,
but I felt that the ability to model based on arbitrary symbols,
MIME-encoded content, etc. would be valuable. Plus, it's a lot
easier.

Consider an email to be a linear string of bytes. I have a sliding
window which is 12 (chosen somewhat arbitrarily) characters long and
slides along the email from beginning to end, and at starting
position, I store those 12 characters in a database. I store in the
form of a tree, but a simple list would work just as well (performance
and memory usage are the only tradeoffs). For each sequence of 12
characters, I store the number of occurrences that have ever been
observed as two counts, one being the number of occurrences in spam,
and the other for non-spam.

Given this database, I can generate Markov chains. If my sequences
are 12 characters long, I can begin with an arbitrary sequence of 11
characters and extract the subset of the database whose first 11
characters match my initial sequence. In this subset, every sequence
has spam and non-spam counts, but let's pick one for the sake of
discussion and say we want to generate spam. Each count represents
the number of times each 12th character has appeared after the first
11. I can generate a random number and choose one of those 12th
characters so that ones which have been observed more often are more
likely to occur. Throwing away the first character and taking the
12th character as a new 11th character, I can repeat this process and
generate text which is clearly recognizable as spam. For those who
are interested in seeing how far this idea can be taken, I tried it on
audio and produced meaningful results.

Judging spam

Of course, the point behind this is to filter spam, so I have tried
two approaches so far to judge new emails against the training data.

The first, most obvious approach is to look up each 12-character
sequence in an email to be judged and total the number of times each
sequence is found in spam and non-spam. Divide by each by the total
spam and non-spam counts for the whole database, and divide by the
number of 12-character sequences in the email, and you get two numbers
which are the probabilities that the email is spam and non-spam. If
the spam probability is greater than the non-spam probability, then
you declare it spam. Unfortunately, this approach yielded very
unsatisfactory results, and I don't know why. I would judge some of
the training emails, and they would often be judged incorrectly. So
my first question is: Why didn't this technique work?

The next approach I tried is similar to the text generation technique
described above. For each 11-character sequence in the suspect email,
I would consider only the subset of the database which matches the
first 11 characters of the input sequence. Then I would judge whether
or not a given sequence indicated spam by divding the counts of the
matching 12 character sequence by the total counts from the subset
matching the first 11. Of course, this is a great deal more like how
I would imagine Markov chains should work, but I'll need someone to
confirm for me if that is the case. Is it? But the important
question here is: Why did this approach work so much better than the
first?

Given the first technique, I can judge relative probabilities. I can
compute the chances a given email is spam and the chances a given
email is not. But for the second, how do I do that? I can guess that
one would do something similar to the first case. Given the count
from a sequence of 12, divide by the total counts from the matching
sequences of 11. Sum those over all of the sequences in a suspect
email and then divide by the total number of sequences in the input.
But that would not produce a number much like from the first
technique. I can understand the MEANING of the result of the first,
even though it's often an incorrect result, but what it the meaning of
a result from the second?

Generating a meaningful result

Finally, I have two probabilities. One is the probability that
something is spam, given what spam looks like. The other is the
probability that something is not spam, given what non-spam looks
like. How do I turn those two numbers into one which indicates how
likely something is to be spam?

In summary, I would like to understand certain things about this:

Am I correctly understanding the philosophy of Markov chains?
Why didn't the first technique work?
Why did the second?
What is the meaning of the second?
How do you sum probabilities?
12 character sequences seems to work well. Would shorter be better?
How do I turn a spam probability and a non-spam probability into one
spam probability?

Any suggestions will be very much appreciated. Thanks.

Programmer Dude

unread,

Sep 3, 2002, 2:50:58 PM9/3/02

Timothy Miller wrote:

> For those interested in seeing the algorithm, I have made it available
> here:

Fascinatin' approach!

--
|_ CJSonnack <Ch...@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
|_____________________________________________|_______________________|

Opinions expressed herein are my own and may not represent those of my employer.

Gottfried Helms

unread,

Sep 3, 2002, 4:41:21 PM9/3/02

Hi,

I am surprised, that your markov filter using length of 12 is working at all.
Did you check, how the frequencies of the 12-character-elements are? I guess
the overwhelming number of them occurs only 1 time. So maybe the improving
that you observe by relativating these frequencies with the frequencies of
length 11 could come simply, that there are usable relative frequencies
then.
When I fiddled around with markov-chains like this, I had best results with
lenghtes of 3 to 5 characters. But it's a long time ago, and I don't have the
things handy. Just a suggestion. Perhaps your 12/11-ratio-filter is special good,
anyway.
On the other hand: to have an effective spam-filter, I'd be ready to use even
unrealistic lengthes... ;-)

Gottfried Helms

john bailey

unread,

Sep 3, 2002, 4:41:20 PM9/3/02

On 2 Sep 2002 19:30:53 -0700, the...@hotmail.com (Timothy Miller)
wrote:

>For some time now, I have been working on a Markov model based spam
>filter. The algorithm I have for judging emails against the training
>set of spam and non-spam seems to work quite well. Unfortunately
>having no background in statistics, I don't really know WHY it work
>this well, and having an answer to that would help me to improve not
>only reliability but the usefulness of the information reported. I do
>want to understand this fully, so don't hesitate to get into a
>discussion with significant depth.

>Consider an email to be a linear string of bytes. I have a sliding

>window which is 12 (chosen somewhat arbitrarily) characters long and
>slides along the email from beginning to end, and at starting
>position, I store those 12 characters in a database. I store in the
>form of a tree, but a simple list would work just as well (performance
>and memory usage are the only tradeoffs). For each sequence of 12
>characters, I store the number of occurrences that have ever been
>observed as two counts, one being the number of occurrences in spam,
>and the other for non-spam.

>Generating a meaningful result

>
>Finally, I have two probabilities. One is the probability that
>something is spam, given what spam looks like. The other is the
>probability that something is not spam, given what non-spam looks
>like. How do I turn those two numbers into one which indicates how
>likely something is to be spam?

>Any suggestions will be very much appreciated. Thanks.

If you consider your Markov based spam generator a soothsayer, making
probabilistic predictions, a sound theoretical base is easily grasped.
There was a dsicussion of the topic Scoring a soothsayer 1998/12/02
for which Miguel A. Lerma posted the correct solution
Message-ID: <742527$m...@news.acns.nwu.edu>#1/1

For your spam filter, you just consider the markov model's prediction
as if you were scoring a soothsayer, accumulate the scores and when
the score difference between predicting spam and non-spam reaches an
threshold, declare truth.

For your convenience, I have appended Miguel's post here:

From: le...@math.nwu.edu (Miguel A. Lerma)
Subject: Re: Scoring a soothsayer
Date: 1998/12/02
Message-ID: <742527$m...@news.acns.nwu.edu>#1/1
References: <36632f88...@news.frontiernet.net>
Followup-To: rec.puzzles,sci.math
Organization: Northwestern University, Evanston, IL, US
Newsgroups: rec.puzzles,sci.math

John Bailey (jmb...@frontiernet.net) wrote:
: A soothsayer has a generally good, but not perfect knowledge of the
: probability of events in a series which will occur. He/she gives
: his/her estimate of the probability of the event and all learn the
: outcome of the event before his/her next prediction. By most means of
: scoring, the soothsayer, having a good estimate of any event's
: probability can use that knowledge to improve his/her score. Devise a
: means of scoring the soothsayer's performance such that the soothsayer
: has no incentive to moderate or exaggerate his/her predictions.
: Think of a weatherman (a soothsayer if there ever was one) who gives
: the probability of rain tomorrow. She quotes a value p(i) for the
: probability of rain on the i'th day. If it rains, she gets f(p)
: points, if it doesn't rain she gets g(p) points. Her score will be
: summation from 0 to i= today{f(p(i))*v(i)+ g(p(i)*(1-v(i} where v(i)
: are the outcomes, rain, no rain, no rain, rain, etc expressed as a
: binary vector. In most scoring schemes, on the i'th day, she can look
: at her sum of scores for being right and those for being wrong, and
: can adjust her announced estimate of p for that day to increase her
: likelihood of getting a higher score. There is at least one family of
: scoring values (functions of p) for which this is not the case.

The optimal scoring scheme consists of taking (except for a positive
multiplicative constant) f(p) = log(p), g(p) = f(1-p). Then the
expected score on the i-th day will be:

S(i) = P(i) log(p(i)) + (1-P(i)) log(1-p(i))

where P(i) is the "actual" probability of rain on that day.
For a given probability P(i), S(i) is maximum when p(i) = P(i),
i.e., when the weatherman gives the "actual" probability,
so she has no advantage in manipulating it.

It is interesting to note that this scoring scheme not only
discourage the weatherman from manipulating probabilities,
it also rewards the accuracy of her predictions. If for any
chance the weatherman gets a better knowledge about the weather,
she will be able to give probabilities closer to 1 and 0
- also the "actual" probabilities, which depend on the amount of
information available about the system, will be closer to 1 or 0.
Since the function x log(x) + (1-x) log(1-x) gets larger
when x gets closer to 0 or 1, the scoring will also be larger.
Actually that function can be used to measure the amount of
information available about the system (that function with
the opposite sign is the "Shannon entropy" of the system).

Miguel A. Lerma

I forgot to add that there is a natural generalization to events
with n possible outcomes (the rain/no rain case is an example with
n=2). The scoring scheme would be as follows: if the soothsayer
assigns probabilities p_1, p_2,..., p_n respectively to possible
outcomes O_1, O_2,..., O_n, and actually O_k happens, then we
give her f(p_k) points, where f is the function we are looking for.

Also in this case f(p) = log(p) provides an expected score that
is maximized for p_k = actual probability of k, rewards accuracy
in predictions, and has nice connections with information theory.

Miguel A. Lerma

John Bailey
http://home.rochester.rr.com/jbxroads for my real email address

Oliver Bandel

unread,

Sep 3, 2002, 7:01:09 PM9/3/02

IMHO the length of the window can be very different
in working as a good (or bad) filter, depending on
the language as well as the theme, (in) which is written
about in the mails.

If the filter works well for most mails, it may
will work ugly for other mails, e.g. special theme
(wordlength and character-to-character-probabilities
differ then) or other base languages.

So, if you want to filter with such algorithms, you should
test it with a very large and differnet datasets.
Try english, german, arabic, some afrcican dialects.

Try with "normal" (typical english "smalltalk") mails,
with medicine-texts, with themes of architecture,
history, politics, psychology, programming, prose,
philosophy, business, ...

And try to use both categories: base language and
theme in all possible combinations.

Then you may have a solution, which may have to
use different filters to distinguish different
kinds of mails, and spam may only be one special
kind of mails: normally written in english
language (IMHO) and IMHO will have similarities
to business/marketing.

Ciao,
Oliver

john bailey

unread,

Sep 3, 2002, 9:38:58 PM9/3/02

On 2 Sep 2002 19:30:53 -0700, the...@hotmail.com (Timothy Miller)
wrote:

>For some time now, I have been working on a Markov model based spam

>filter. The algorithm I have for judging emails against the training
>set of spam and non-spam seems to work quite well. Unfortunately
>having no background in statistics, I don't really know WHY it work
>this well, and having an answer to that would help me to improve not
>only reliability but the usefulness of the information reported. I do
>want to understand this fully, so don't hesitate to get into a
>discussion with significant depth.

Mitsubishi may have anticipated the idea in their patent granted
August 29, 2000

US Patent #6,112,021 Markov model discriminator using negative
examples, which can be found using the US Patent office search page:
http://patft.uspto.gov/netahtml/search-bool.html
(quoting)
The subject system is used for identifying particular traits or
characteristics of sequences to permit identification of, for
instance, inappropriate web page material, hand signing gestures,
audio program material type, authorship of a text, with the system
also being useful in speech recognition, as well as seismic, medical,
and industrial monitoring. (end quote)

I think its a great idea!

0 new messages