New kind of Spam Filter

Roedy Green

unread,

Sep 21, 2003, 5:31:31 PM9/21/03

to

I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.

What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

A user written filter gets passed a MimeMessage object, and returns a
float representing the probability this is spam, or the probability
this is definitely good. The user implements either an IsSpam or
IsHam interface.

It might come with a number canned filters, e.g. everyone in my Eudora
address book is considered ham, all mail not addressed to me is spam,
all mail addressed to more than N people is spam. Something that
recognizes the variation on the current's worm of the day's email, no
Chinese or Korean messages,

The advantage is, you can add any feature you like without having to
write an entire program. You can write a filter just to get rid of a
particular class of annoying spam, like Nigerian scam letters.

You might write custom filters for your customers so they don't have
to do fancy configuring. You just start the thing up then ignore it.

It would have no GUI, just a configuration file written in Java that
you compile to create the app you need.

Alternatively it might use class for Name and not require compilation
of the config file.

It either deletes the message, or perhaps adds a "probable spam"
indicator to the subject line for filtering in the email program
manual lookover.

Ideally people might contribute their user-written filters for others
to use and or modify.

To reduce ram overhead, since it runs all the time, you might compile
it with JET.

I already have much of this code working as part of my bulk remailer.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Wojtek

unread,

Sep 21, 2003, 7:27:01 PM9/21/03

to

On Sun, 21 Sep 2003 21:31:31 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>What I am proposing is a simple Javamail framework that looks at
>messages on the server, and runs a number of user written filters on
>them.

I own my own domain, but do not manage the servers. So I get all the
SPAM directed at any address in the domain. My provider currently does
not filter anything (which is OK, as I prefer to do it myself).

Anyway, for this type of app to work for me, it would run on my
workstation and it would need FTP capability. That is:
- open an FTP connection to the mail directory
- scan through the files
-- apply the filter criteria
--- delete SPAM
- cleanup

>It might come with a number canned filters, e.g. everyone in my Eudora
>address book is considered ham, all mail not addressed to me is spam,
>all mail addressed to more than N people is spam. Something that
>recognizes the variation on the current's worm of the day's email, no
>Chinese or Korean messages,

I get a LOT of emails with identical subject lines (or very close to
identical), so some method of saving the subject patterns between
"runs" would be good.

------------------------
Wojtek Bok
Solution Developer

Roedy Green

unread,

Sep 21, 2003, 9:59:34 PM9/21/03

to

On Sun, 21 Sep 2003 23:27:01 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>Anyway, for this type of app to work for me, it would run on my
>workstation and it would need FTP capability. That is:
>- open an FTP connection to the mail directory
>- scan through the files
>-- apply the filter criteria
>--- delete SPAM
>- cleanup

why would ftp be preferable to POP3 to do this?

Wojtek

unread,

Sep 22, 2003, 12:43:40 AM9/22/03

to

On Mon, 22 Sep 2003 01:59:34 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>On Sun, 21 Sep 2003 23:27:01 GMT, Wojtek <su-...@bossi.com> wrote or

>quoted :
>
>>Anyway, for this type of app to work for me, it would run on my
>>workstation and it would need FTP capability. That is:
>>- open an FTP connection to the mail directory
>>- scan through the files
>>-- apply the filter criteria
>>--- delete SPAM

The missing assumed step was that legitimate email would be left on
the server for the email client.

>>- cleanup
>
>why would ftp be preferable to POP3 to do this?

Well, I don't use Eudora....

I am not sure how this would work with an email client. I assumed that
it was for host side: filter the email before the client gets it via
POP3.

Or does it interpose itself between the email client and the IP stack?
If so, it would be best placed on my firewall, where it would act for
all my internal machines.

Or do I use it to to get my email (via POP3) from the ISP to my
firewall, then I point my email client to the firewall which then acts
as the POP3 email server? If so, will it provide POP3 services, or do
I need to get one?

Besides my own domain, I am also the webmaster for another. That one
has several people who get email on it. How would this work for that
domain (I cannot run this on the actual mail server as it is owned by
the ISP)? All those users use Outlook.

I like the idea, especially the configurable filters. The email client
I use (PMMail 2000) already has built in filters as well as a somewhat
arcane filter language. But it does not have "memory" nor the
capability to compare all the emails as a group for patterns.

Roedy Green

unread,

Sep 22, 2003, 10:23:38 AM9/22/03

to

On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>Or does it interpose itself between the email client and the IP stack?

>If so, it would be best placed on my firewall, where it would act for
>all my internal machines.

There are two approaches I have seen used.

The simpler approach is to have the spam filter act like an email
client. It goes, sniffs the mail, and deletes spam and leaves the
good stuff on the server. Then you run your real email program. The
problem is some new spam could have come in between the time you ran
the filter and you picked up your mail. IT usually only downloads the
first paragraph or so of the message to decide if it is spam.

The other technique is to implement the filter as an email proxy
server. You then have your email program talk to localhost:9999
instead of the regular mailserver. The K9 people did this, but only
on the pop3 side. Eudora thus does not work since you can't configure
the SMTP side independently.

To deal with spam you can either delete it on the server, or mark it
specially e.g. put [Spam] in subject line, where it is easy for the
mail program to filter it.

For this virus-generated stuff where the messages themselves are
fairly fat with an attachment, it makes sense to delete on the server
without downloading the whole thing.

Roedy Green

unread,

Sep 22, 2003, 10:29:52 AM9/22/03

to

On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>Besides my own domain, I am also the webmaster for another. That one

>has several people who get email on it. How would this work for that
>domain (I cannot run this on the actual mail server as it is owned by
>the ISP)? All those users use Outlook.

I was thinking of something that ran client side, since it would talk
POP3. However, you could run it on the server, just talking to the
mailserver locally.

Probably more efficient though would be to write a different framework
that ran the same filters (not as tight as for individuals) to filter
all mail coming into an ISP.

The main idea of the program is that it is user-extensible with
arbitrarily complicated Java code. It would be fairly trivial to add
a blacklist ISP filter, a friends/enemies filter. All the work does
not fall on one person. You don't have the political problem of
convincing the author your style of filter is important or explaining
just how your filter should work.

Java code is something we all understand. Learning how to write
filters by gui often gets you 90% of way to where you want to be and
leaves you dangling.

Roedy Green

unread,

Sep 22, 2003, 10:31:27 AM9/22/03

to

On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>I like the idea, especially the configurable filters. The email client

>I use (PMMail 2000) already has built in filters as well as a somewhat
>arcane filter language.

If you are trying to support clients, they all have different email
programs. Further you can't usually figure out the filter then send
it to them. You have to coach them through setting it up themselves.
Phhht!

Roedy Green

unread,

Sep 22, 2003, 10:34:09 AM9/22/03

to

On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>. But it does not have "memory" nor the

>capability to compare all the emails as a group for patterns.

This is trickier.

The framework could ask the filter if it wanted to be called twice.
It would get a chance to look at all the incoming new mail first, then
on the second pass make its final decision.

Or perhaps it would only get one look at each new piece, but it would
be at liberty to maintain its own persistent internal state.

Gary M

unread,

Sep 22, 2003, 12:29:19 PM9/22/03

to

Roedy Green <ro...@mindprod.com> wrote in
news:1a5smvgv6d0fp4sp4...@4ax.com:

> I wonder if anyone would be interested in a beast such as this.
> It is a sort of roll-your-own spam filter. I have been trying out
> various filters and none work. They all have some fatal flaw that I
> can't fix.
>

Hi Roedy, this is a subject of great interest for me.

As a long aside, I have written my own spam terminator program which I
called (misnomered now) PopSpam. It uses Javamail with JeTty embedded as
a default servlet container for the GUI. It worked along the lines of a
mail client that polls the pop or imap server periodically and executes
white and black rules.

I built it over MySQL, but with data access interfaces allow other
persistence layers to be built. Likewise rules are pluggable and I employ
a simple java interface to write your own rules and importantly,
prioritize execution, as rules can be expensive endeavours (example: you
want to perform your body checks as a last resort as this requires a full
message download). It also has some self optimization built in that
organizes the most successful resultants to be checked first, for
example.

I run the app over a number friends and family's accounts and I have a
better than 99% success rate. I have adopted a fundamental position that
statistical score based filters are unworkable for the _common_ user
(important distinction), because these statistical methods require
users/organizations to maintain corpora of 'bad'and 'good' spam (what
does the pharmacy think when its legit viagra mail gets wiped). So my
approach is based entirely on deterministic rules with spam features that
are personalized to the user/corporation; that said, it would be
relatively simply to employ some rules that query a model. I just don't
see much mileage in it for most users.

My basic premise is that every thing that meets a black rule is deleted
(stored actually in case of false positive) and everything that meets a
white rule is allowed through. This means it would be possible for me to
recieve email from you even though you are not known to me. However, if
your message was spammy you'd probably not get through. There is the
achilles heel and every antispam solution has one. This is mitigated as I
store the message for recovery.

My ultimate goal is to built an antispam program for admins and not a
mail user that acts as a SMTP proxy prechecking messages. It would
require the mail user to review caught spam periodically and to create
and customize their white rules.

The reason for this approach is that I feel spammers and most antispam
tools exploit both sides of the same problem: bandwidth is cheap, so I
can spam as much as I like, says the spammer. The antispam tool says
bandwidth is cheap so I can download the whole message and check it at
the client. The antispam tool I envision is attractive to organizations
as it would address this fallacy.

Anyway getting back to your post. I am interested in this project if it
gets going. I'd be willing to help out in any way I can. My own effort
has reached a stable point, but I have lost some enthusiasm to finish it
given the almost daily announcements of similar tools that are making for
a rather crowded marketspace.

Gary

Gary M

unread,

Sep 22, 2003, 12:31:25 PM9/22/03

to

Gary M <gax...@yahoo.com(xx=ry)> wrote in
news:Xns93FE7F06A30DEg...@216.168.3.44:

> I run the app over a number friends and family's accounts and I have a
> better than 99% success rate. I have adopted a fundamental position that
> statistical score based filters are unworkable for the _common_ user
> (important distinction), because these statistical methods require
> users/organizations to maintain corpora of 'bad'and 'good' spam

This should read "'bad' and 'good' messages".

David Segall

unread,

Sep 22, 2003, 12:58:06 PM9/22/03

to

Roedy Green <ro...@mindprod.com> wrote:

>I wonder if anyone would be interested in a beast such as this.
>It is a sort of roll-your-own spam filter. I have been trying out
>various filters and none work. They all have some fatal flaw that I
>can't fix.

I agree but I think it may be more productive to implement an email
client to solve the problem. The client already knows the format of
your address book and has "read" and, if necessary stored, your
previous emails.

>What I am proposing is a simple Javamail framework that looks at
>messages on the server, and runs a number of user written filters on
>them.
>
>A user written filter gets passed a MimeMessage object, and returns a
>float representing the probability this is spam, or the probability
>this is definitely good. The user implements either an IsSpam or
>IsHam interface.

It may be preferable to provide an optional separate set of filters
for the message headers to avoid a double download of legitimate large
emails. Actually this would be useful to avoid the download of the
attachments in the current wave of spam which, in my case, is around
200MB per day.

>It might come with a number canned filters, e.g. everyone in my Eudora
>address book is considered ham, all mail not addressed to me is spam,
>all mail addressed to more than N people is spam. Something that
>recognizes the variation on the current's worm of the day's email, no
>Chinese or Korean messages,
>
>The advantage is, you can add any feature you like without having to
>write an entire program. You can write a filter just to get rid of a
>particular class of annoying spam, like Nigerian scam letters.
>
>You might write custom filters for your customers so they don't have
>to do fancy configuring. You just start the thing up then ignore it.
>
>It would have no GUI, just a configuration file written in Java that
>you compile to create the app you need.
>
>Alternatively it might use class for Name and not require compilation
>of the config file.
>
>It either deletes the message, or perhaps adds a "probable spam"
>indicator to the subject line for filtering in the email program
>manual lookover.
>
>Ideally people might contribute their user-written filters for others
>to use and or modify.

I like the idea of a library of user-written filters particularly
because some of them would have to interpret one or more of the many
address list formats used by email clients. I would really like a
filter which tells me that the return address is invalid but it would
require a much better knowledge of the protocols than I possess.

>
>To reduce ram overhead, since it runs all the time, you might compile
>it with JET.
>
>I already have much of this code working as part of my bulk remailer.

That's a good argument for ignoring my idea of writing an email
client. :) Publish a "pre-Alpha" version and see what happens.

Neil Campbell

unread,

Sep 22, 2003, 2:28:42 PM9/22/03

to

Roedy Green wrote:

> I wonder if anyone would be interested in a beast such as this.
> It is a sort of roll-your-own spam filter. I have been trying out
> various filters and none work. They all have some fatal flaw that I
> can't fix.
>
>
> What I am proposing is a simple Javamail framework that looks at
> messages on the server, and runs a number of user written filters on
> them.

I think the problem with this approach is that simple user-written filters
aren't usually terribly successful. To effectively keep out the majority
of spam you need much more sophisticated techniques.

If your framework provides a way for people to write things like bayesian
filters more easily, then it would be very valuable; however I think it
will only be as good as the filters available for it. Most of the very
simple filters can usually be implemented by the mail client itself, of
course (at least in KMail and Outlook, presumably in others as well).

In my opinion, you'd have to think about what your system would provide that
similar tools don't.

--
Neil Campbell
batneil[AT]lineone[DOT]net
http://www.thebatcave.org.uk

Roedy Green

unread,

Sep 22, 2003, 3:56:27 PM9/22/03

to

On Mon, 22 Sep 2003 16:58:06 GMT, David Segall <da...@segall.net>
wrote or quoted :

>That's a good argument for ignoring my idea of writing an email
>client. :) Publish a "pre-Alpha" version and see what happens.

Here is a first cut at the interface for spam filters:

package com.mindprod.spam;
import javax.mail.internet.MimeMessage;

/**
* Interface for a spam filter.
*
* @author Roedy Green
* @version 1.0
* @since 2003-09-22
*/
public interface SpamDetect
{
/**
* What is the probability the given message is spam?
* 0.0 = definitely good.
* 0.5 = 50-50 odds
* 1.0 = absolutely certainly spam.
* -1 = no opinion.
*
* @param message MimeMessage from which you can extract any fields
of interest.
*
* @return probability
*/
public float probabilityIsSpam ( MimeMessage message );

/**
* Fire up this filter.
* Do any one-time initialisation,
* e.g. load tables, restore persistent state.
*/
public void open();

/**
* Shutdown this filter,
* e.g. save persistent state, free resources.
*/
public void close();

Roedy Green

unread,

Sep 22, 2003, 4:04:34 PM9/22/03

to

On Mon, 22 Sep 2003 19:28:42 +0100, Neil Campbell <ne...@nospam.com>
wrote or quoted :

>In my opinion, you'd have to think about what your system would provide that
>similar tools don't.

SpamDetective : would allow large numbers of messages which it does
not.

K9, SpamBayes : would let you use it with Eudora which K9 does not.

MailWasher : would let you use it with large numbers of messages which
it does not.

various server based solutions: let you use it without co-operation of
your ISP or server admin folk.

Vipul's Razor: easier to install and configure, if you just used
canned filters. Perhaps someone could even build a filter than used
the razor protocol.

SaProxy, uses 80 MB ram. Presumably we could do better with Jet
compilation.

Bogofilter: C source only, does not run on windows.

The key thing is the ability to whip up your own little filter to nail
your own particular problem using your familiar Java tools.

Mark Thornton

unread,

Sep 22, 2003, 4:07:42 PM9/22/03

to

To be effective in the current circumstances, the filter needs to
actually run on the server so that it can work while your client
computer is switched off or disconnected from the net. My ISP limits the
size of my mail box on their server to 10MB, thus with >1100 messages in
the past 12 hours (~165MB) the box would have filled many times over if
my machine had not been continuously collecting (and filtering them).

In some cases a server based filter might be able to reject a message
before the entire message had been received (e.g. based on the title or
when a .exe attachment is encountered).

Mark Thornton

P.s. Of course I can't compete with the 5GB you received, but this a
competition I would rather not be in.

Neil Campbell

unread,

Sep 22, 2003, 8:18:33 PM9/22/03

to

Roedy Green wrote:

> The key thing is the ability to whip up your own little filter to nail
> your own particular problem using your familiar Java tools.

Fair enough, but in my experience these sorts of problems are those in which
you want to block all messages with a particular phrase in the subject
line, or messages from a particular domain. In these cases, the mail
client often provides enough functionality to deal with it.

The more complex cases of blocking spam in general are difficult to deal
with using user-written filters. Tools like Popfile go some way to
stopping these, and implementing similar tools again would be
time-consuming at best.

I agree totally with the validity of permitting this sort of filtering to be
done at the client; it is usually impractical to persuade an ISP to
implement something useful.

Please don't interpret these comments as negative; I think the project is
definitely a worthwhile one. I simply feel that the sort of 'little
filters' that could be easily written for this would be somewhat limited.
If this is taken further, however, I'd love to add support for it to my
mail client (which I'm gradually progressing towards a workable release).

Wojtek

unread,

Sep 22, 2003, 10:58:22 PM9/22/03

to

On Mon, 22 Sep 2003 14:23:38 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or

>quoted :
>
>>Or does it interpose itself between the email client and the IP stack?
>>If so, it would be best placed on my firewall, where it would act for
>>all my internal machines.
>
>There are two approaches I have seen used.
>

>The other technique is to implement the filter as an email proxy
>server. You then have your email program talk to localhost:9999
>instead of the regular mailserver.

That make sense....

>The K9 people did this, but only
>on the pop3 side. Eudora thus does not work since you can't configure
>the SMTP side independently.

Really? That's strange. My ISP has SMTP for outgoing, yet I get my
email from my domain via POP3. My domain provider does not allow SMTP
unless it comes from their own network (dialup accounts).

>To deal with spam you can either delete it on the server, or mark it
>specially e.g. put [Spam] in subject line, where it is easy for the
>mail program to filter it.
>
>For this virus-generated stuff where the messages themselves are
>fairly fat with an attachment, it makes sense to delete on the server
>without downloading the whole thing.

So you grab the first X bytes via what? I would think FTP, or can POP3
do this?

Wojtek

unread,

Sep 22, 2003, 11:00:00 PM9/22/03

to

On Mon, 22 Sep 2003 14:29:52 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or

>quoted :
>
>>Besides my own domain, I am also the webmaster for another. That one
>>has several people who get email on it. How would this work for that
>>domain (I cannot run this on the actual mail server as it is owned by
>>the ISP)? All those users use Outlook.
>
>I was thinking of something that ran client side, since it would talk
>POP3. However, you could run it on the server, just talking to the
>mailserver locally.
>
>Probably more efficient though would be to write a different framework
>that ran the same filters (not as tight as for individuals) to filter
>all mail coming into an ISP.
>
>The main idea of the program is that it is user-extensible with
>arbitrarily complicated Java code. It would be fairly trivial to add
>a blacklist ISP filter, a friends/enemies filter. All the work does
>not fall on one person. You don't have the political problem of
>convincing the author your style of filter is important or explaining
>just how your filter should work.

Ok, then the app would have all filters, then using a config file
(XML?) you would configure the filters and which ones were live.

Wojtek

unread,

Sep 22, 2003, 11:04:29 PM9/22/03

to

On Mon, 22 Sep 2003 14:34:09 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>On Mon, 22 Sep 2003 04:43:40 GMT, Wojtek <su-...@bossi.com> wrote or

>quoted :
>
>>. But it does not have "memory" nor the
>>capability to compare all the emails as a group for patterns.
>
>This is trickier.
>
>The framework could ask the filter if it wanted to be called twice.
>It would get a chance to look at all the incoming new mail first, then
>on the second pass make its final decision.
>
>Or perhaps it would only get one look at each new piece, but it would
>be at liberty to maintain its own persistent internal state.

I think persistent storage. The greater the sample, the more accurate
the analysis.

Maybe even a central server with the signatures? Hmm, I think this has
been done already. But we can do it "better" :-))

Each filter would (should?) have the option of saving some state
information. If nothing else a simple hit count. Either through its
own code, or using the framework's classes.

Wojtek

unread,

Sep 22, 2003, 11:08:20 PM9/22/03

to

On Mon, 22 Sep 2003 14:29:52 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>The main idea of the program is that it is user-extensible with

>arbitrarily complicated Java code. It would be fairly trivial to add
>a blacklist ISP filter, a friends/enemies filter.

Each filter could return a probablility. The framework could then
tally the probabilities and assign a weight to the email. If the
weight value surpasses some amount (or a filter returns some
significant probability), then the email is SPAM, not the real piggy
stuff (love the SPAM <-> HAM usage :-)).

Roedy Green

unread,

Sep 23, 2003, 3:12:10 PM9/23/03

to

On Tue, 23 Sep 2003 02:58:22 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>So you grab the first X bytes via what? I would think FTP, or can POP3
>do this?

I know spamDetective does this. I gather it just starts the read and
aborts. In Javamail, the reading is transparent. You don't know
really when the i/o goes on unless you monitor traffic to figure out
how it works. It may thus be harder in JavaMail to avoid downloading
more than you need.

Roedy Green

unread,

Sep 23, 2003, 3:12:11 PM9/23/03

to

On Mon, 22 Sep 2003 21:07:42 +0100, Mark Thornton
<m.p.th...@ntl-spam-world.com> wrote or quoted :

>
>P.s. Of course I can't compete with the 5GB you received, but this a
>competition I would rather not be in.

The ISP has to pay for the bandwidth of all this crud, at about 50K
each. It has to be stopped even before it arrives.
That's why I have for now a semi secret email account visible to
humans but not to most robots on my site.

Roedy Green

unread,

Sep 23, 2003, 3:12:11 PM9/23/03

to

On Tue, 23 Sep 2003 03:00:00 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>Ok, then the app would have all filters, then using a config file

>(XML?) you would configure the filters and which ones were live.

Given this is a tool for Java programmers, you might just write a
piece of Java code that listed the filters you wanted in the order you
wanted. I would want to avoid class for name, to give native code
optimisers the best possible shot.

GaryM

unread,

Sep 23, 2003, 5:22:48 PM9/23/03

to

Roedy Green <ro...@seewebsite.com> wrote in
news:rk31nv8s4jnv2c4bt...@4ax.com:

> You don't know
> really when the i/o goes on unless you monitor traffic to figure out
> how it works. It may thus be harder in JavaMail to avoid downloading
> more than you need.
>

The Message will be only get header info unless you ask for the body.
Been there, sniffed that, if you get my meaning.

Roedy Green

unread,

Sep 23, 2003, 5:49:35 PM9/23/03

to

On Tue, 23 Sep 2003 21:22:48 -0000, GaryM <gar...@yahoo.com> wrote or
quoted :

>
>The Message will be only get header info unless you ask for the body.
>Been there, sniffed that, if you get my meaning.

There are intermediate levels between just header and whole thing that
are useful in spam detection, namely:

just the first X characters of the body.

Just the body without the attachments

GaryM

unread,

Sep 23, 2003, 6:50:48 PM9/23/03

to

Roedy Green <ro...@seewebsite.com> wrote in

news:utf1nvod20ts88ffh...@4ax.com:

>
> There are intermediate levels between just header and whole thing
> that are useful in spam detection, namely:
>
> just the first X characters of the body.

I think that if I am at the body level doing a lexigraphic analysis,
then the whole body will always perform better than anything less.
Don't forget the numero uno spam feature, to wit, The Unsubscribe
Message and all of its permutations, is always near the end.

>
> Just the body without the attachments
> .

In a multipart mime message there is no distinction between 'body' and
'attachment'. These are all parts and you can make an intelligent guess
by by looking at the mime type and its disposition. Sadly there are no
guarantees where any will occur, so you must parse them all or parse
until you meet an assumption (like the first text/* part is the one I
want analyze).

IMHO, Body checks are definitely the most expensive and you can obtain
excellent performance without them, but they are useful as a last
resort. If not using Javamail, then you can just read the stream and
abort when you've seen enough, but, you will need to decipher mime
boundaries on the fly and decode base 64, quoted-printable etc.

Some other pointers with mail body checks are:

Embedded RFC822 messages which if are multipart require recursion to
parse. Here Javamail is cumbersome but can be wrapped easily to do the
job.

Strip HTML or not? You may have seen, Her<fhjhdfjhd>bal remedy, which
is rendered as Herbal in all mail clients that render html. In general
it is best to strip html, but sometimes the URLs in the body are more
indicting than the domains in the headers.

Just a few thoughts,

Gary

Roedy Green

unread,

Sep 26, 2003, 6:15:26 PM9/26/03

to

On Fri, 26 Sep 2003 15:20:36 -0500, brou...@yahoo.com wrote or
quoted :

>What's the difference between 50/50 odds and no opinion?

Let's say you filtered and kept computing filters until the average
went either below .1 or above .9. 50-50 adds uncertainty to the
moving average, pulling it away from either end. "no opinion" has no
effect on the average.

Wojtek

unread,

Sep 27, 2003, 12:12:08 AM9/27/03

to

On Mon, 22 Sep 2003 19:56:27 GMT, Roedy Green <ro...@mindprod.com>
wrote:

>On Mon, 22 Sep 2003 16:58:06 GMT, David Segall <da...@segall.net>
>wrote or quoted :
>

>Here is a first cut at the interface for spam filters:

And if we want the filters to have some sort of storage capability:

>public interface SpamDetect
> {
public void setStorage( SpamStorage storage );
> }

Where SpamStorage is a concrete class:
- initialized by the framework
- reference kept in a List
- contains a "key" generated by the framework to uniquely identify
this fliter (for a separate table?)
- uses an interface to a database layer

The filter can ignore or use the storage as it wishes. The framework
will be responsible for cleanup.

Roedy Green

unread,

Sep 27, 2003, 2:28:53 PM9/27/03

to

On Sat, 27 Sep 2003 04:12:08 GMT, Wojtek <su-...@bossi.com> wrote or
quoted :

>The filter can ignore or use the storage as it wishes. The framework

>will be responsible for cleanup.

What would be the problem with just persisting to little serialised
files?

Wojtek

unread,

Sep 27, 2003, 2:45:38 PM9/27/03

to

On Sat, 27 Sep 2003 18:28:53 GMT, Roedy Green <ro...@seewebsite.com>
wrote:

>On Sat, 27 Sep 2003 04:12:08 GMT, Wojtek <su-...@bossi.com> wrote or
>quoted :
>
>>The filter can ignore or use the storage as it wishes. The framework
>>will be responsible for cleanup.
>
>What would be the problem with just persisting to little serialised
>files?

That would mean that each filter that needed storage would have to
manage its own storage IO. Duplication of effort.

That is what frameworks are for. To provide common services to
processes. You would not want each servlet (in a Web app) to have its
own logging code.