OT: spam filtering idea

Paul Rubin

unread,

Jan 13, 2003, 10:26:00 AM1/13/03

to

Yeah there are better newsgroups for it, but this is where I hang out
and the subject has come up here. I just thought of it after seeing
the "join our community" thing here on c.l.py.

Anyway, I wonder if the following is a worthwhile hack to improve
Bayesian filtering. Maybe it's already being done--I haven't seen it
done quite this way, but I'm not a spam filtering guru.

The idea is to run the probability coefficients through a digital
filter, so the probabilities decay over time. That is, you give
special emphasis to words found in RECENTLY RECEIVED spam. If you get
message M with the words "banana", "elephant", and "doorknob", that
doesn't make M is especially likely to be spam. But if you got a
piece of spam YESTERDAY with that combination of words, then M is
almost certainly also spam. That lets you crank up the probabilities
for newly arrived spam words to considerably higher levels than you'd
trust in a quasi-static corpus (keep "elephant" high on your list for
too long and it may create false positives later).

The next step is to collect the frequency statistics at various
honeypots around the net, automatically combining them and transfering
them to public databases. Your filter can then retrieve new
statistics over the net every few hours. Any spam you receive will
probably also hit a honeypot at about the same time that you get it.
So since the statistics you've retrieved are weighted for the latest
and freshest spam, you should be able to kill it very effectively.

In case you get the spam faster than the honeypots do, you may not
want to immediately Bayes-filter all incoming mail into spam- and
non-spam folders. Instead, you'd only immediately deliver mail from
addresses on your whitelist. Anything else, you'd hold for say 6
hours, then run it through the Bayes filter for categorization. Since
spam tends to be sent out in batches a few hours long, that delay
should be enough for the honeypots to receive it and update the
databases.

Thoughts?

Feel free to crosspost replies to an anti-spam newsgroup; I don't know
which one to use.

Skip Montanaro

unread,

Jan 13, 2003, 10:39:19 AM1/13/03

to

Paul> Feel free to crosspost replies to an anti-spam newsgroup; I don't
Paul> know which one to use.

Paul,

You might check out <http://spambayes.sf.net/>. The corresponding mailing
list is spam...@python.org.

Skip

Peter Hansen

unread,

Jan 13, 2003, 11:40:03 AM1/13/03

to

Paul Rubin wrote:
>
> Yeah there are better newsgroups for it, but this is where I hang out
> and the subject has come up here. I just thought of it after seeing
> the "join our community" thing here on c.l.py.

> [snip ideas]
> Thoughts?

I've been thinking for some time that this is a perfect area for genetic
algorithms. They will automatically adapt to changing conditions just
as your low-pass filter concept would, but I suspect the number of false
positives and negatives could eventually be quite low.

Of course, upgrading the Internet to require authentication in email
would probably be an even more effective approach...

-Peter

Hans Nowak

unread,

Jan 13, 2003, 12:41:56 PM1/13/03

to

Paul Rubin wrote:

> The idea is to run the probability coefficients through a digital
> filter, so the probabilities decay over time. That is, you give
> special emphasis to words found in RECENTLY RECEIVED spam. If you get
> message M with the words "banana", "elephant", and "doorknob", that
> doesn't make M is especially likely to be spam.

Maybe in the future it will be, when all conventional spam is caught by spam
filters, and certain words cannot be used anymore... "Is your banana the size
of a doorknob? Use <brand X> to make it the size of an elephant!" :-)

--
Hans (base64.decodestring('d3VybXlAZWFydGhsaW5rLm5ldA=='))
# decode for email address ;-)
The Pythonic Quarter:: http://www.awaretek.com/nowak/
Kaa:: http://www.awaretek.com/nowak/kaa.html

Paul Wright

unread,

Jan 13, 2003, 3:07:45 PM1/13/03

to

In article <7xr8bhh...@ruckus.brouhaha.com>,

Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote:
>The next step is to collect the frequency statistics at various
>honeypots around the net, automatically combining them and transfering
>them to public databases. Your filter can then retrieve new
>statistics over the net every few hours. Any spam you receive will
>probably also hit a honeypot at about the same time that you get it.
>So since the statistics you've retrieved are weighted for the latest
>and freshest spam, you should be able to kill it very effectively.

I thought the point of Bayesian filtering was that it learned about your
spam and your legitimate email, so that learning what other people
considered spam wouldn't be as effective. I'm no expert on this, though.
I expect Tim Peters will be along in a minute :-)

<http://www.jerf.org/irights/2002/11/18.html> argues that human malice
can and will defeat Bayesian filters, and that widespread adoption of
them will end up making spam harder to recognize by hand. I'm a little
concerned that the author of the article overestimates the intelligence
of spammers, but I suppose there's a selection pressure on them to get
more cunning as time goes on. The people who successfully spam my
Hotmail spam trap these days are certainly getting cleverer, presumably
in response to Brightmail filtering.

A system which works by reporting mail to honeypots would be better off
reporting hashes of message bodies to something like Vipul's Razor or
the Distributed Checksum Clearinghouse. That said, the obvious spammer
response when people do that is to make messages which are more and more
dissimilar for each recipient, again something where human malice can
probably defeat automated attempts to find similar messages. The DCC's
creator has said that he thinks that it will eventually be most useful
against "mainsleaze", that is, spam from big businesses who will not
want to use the sort of filter-evading tactics which are popular with
the "enlarge your naked cheerleaders"[1] crowd.

[1] How many boneheaded keyword filters will now bounce this post when
it goes out as mail on the python list, I wonder? There's an awful lot
of snake oil out there being sold as spam filters.

--
Paul Wright | http://pobox.com/~pw201 |

Tim Peters

unread,

Jan 13, 2003, 3:42:35 PM1/13/03

to

[Paul Wright]

> I thought the point of Bayesian filtering was that it learned about your
> spam and your legitimate email, so that learning what other people
> considered spam wouldn't be as effective. I'm no expert on this, though.
> I expect Tim Peters will be along in a minute :-)

Cursed to do so, yes <wink>. Trying to train one of these classifiers to
serve a diverse group of users at once is demonstrably and quantifiably much
less effective. Training one for a focused mailing list (like
comp.lang.python) appears extremely effective, though (the spambayes
classifier had error rates too low to measure reliably across a 20K c.l.py
ham + 14K spam test).

> <http://www.jerf.org/irights/2002/11/18.html> argues that human malice
> can and will defeat Bayesian filters, and that widespread adoption of
> them will end up making spam harder to recognize by hand. I'm a little
> concerned that the author of the article overestimates the intelligence
> of spammers, but I suppose there's a selection pressure on them to get
> more cunning as time goes on.

He's probably right that the way to beat this generation of filters is to
create spam statistically indistinguishable from ham. The unknown not
addressed there is that all forms of advertising are a percentage game, and
current spam uses (e.g.) ALL CAPS and huge fonts and bright colors because
those tricks increase response rate. Spam so bland that it looks like it
came from your grandmother may not draw a response rate large enough to
repay the costs of spamming (which, while tiny on a per-msg basis, aren't
zero).

> ...

> [1] How many boneheaded keyword filters will now bounce this post when
> it goes out as mail on the python list, I wonder? There's an awful lot
> of snake oil out there being sold as spam filters.

I received it via the c.l.py mailing list gateway. My personal spambayes
filter gave your msg a score of 0.9998 for haminess (0.0 = spam, 1.0 = ham),
and 1.422e-07 for spaminess (0.0 = ham, 1.0 = spam). So, overall, it was
certain the msg was ham, and despite that it contained some mildly strong
spam words:

'intelligence' 0.958257
'legitimate' 0.961219
'businesses' 0.963789

An individual word can neither damn nor redeem a msg in this kind of
approach; it's more a "preponderance of evidence" approach. Effective spam
has to get around to selling you something, and the language of advertising
as a whole is very damning (for example, in c.l.py tests, the putative ham
we had the most trouble with was conference announcements: they're trying
to sell you on attending a conference, and share much language with hardcore
spam because of it).

je...@compy.attbi.com

unread,

Jan 13, 2003, 7:39:12 PM1/13/03

to

On Mon, 13 Jan 2003 15:42:35 -0500, Tim Peters wrote:
> He's probably right that the way to beat this generation of filters is to
> create spam statistically indistinguishable from ham. The unknown not
> addressed there is that all forms of advertising are a percentage game, and
> current spam uses (e.g.) ALL CAPS and huge fonts and bright colors because
> those tricks increase response rate. Spam so bland that it looks like it
> came from your grandmother may not draw a response rate large enough to
> repay the costs of spamming (which, while tiny on a per-msg basis, aren't
> zero).

I couldn't think of a reasonable way to predict the results of that,
because as I think I mentioned in another posting, there are two big
unknowns: The nature of the people responding to the spams (have you every
really thought about it? who the hell is keeping these things afloat? In
all seriousness, my current theory is that we're talking people of reduced
intelligence, but I don't *know*.), and how close the spam industry may be
to economic collapse, such that Bayes-type filters (which *are*
legitimately better then previous approaches) may be enough to tip them
over the edge. Without more data about those two things it's hard to
predict what will happen if spam tones down.

Somwhat back on the Python topic, once SpamBayes is done I intend to see
if I can implement what I talked about. It's just not worth picking up an
implementation in another language when it'd probably be a small handful
of hours' work in Python...

Tim Peters

unread,

Jan 13, 2003, 8:42:16 PM1/13/03

to

[je...@compy.attbi.com]

> I couldn't think of a reasonable way to predict the results of that,
> because as I think I mentioned in another posting, there are two big
> unknowns: The nature of the people responding to the spams (have you
> every really thought about it? who the hell is keeping these things
> afloat? In all seriousness, my current theory is that we're talking
> people of reduced> intelligence, but I don't *know*.), and how close the
> spam industry may be to economic collapse, such that Bayes-type filters
> (which *are* legitimately better then previous approaches) may be enough
< to tip them over the edge. Without more data about those two things it's
> hard to predict what will happen if spam tones down.

Well, I've noted before (but on the spambayes mailing list) that I expect
widespread adoption of this kind of classifier may actually increase spam,
while *not* toning it down at all. The thing is that the system has no
predefined notions of "ham" or "spam": it believes whatever you train it to
believe. For example, I get a particular class of "Joke of the Day" spam,
which I sometimes enjoy. My personal classifier is trained to consider that
ham, and despite that the rest of such a msg hawks everything from human
growth hormone to cheap ink jet cartridges (and it came as a surprise just
how fine are the distinctions the classifier can make). OTOH, there are
some kinds of email I get from companies I do business that I'd rather not
be bothered with, and the system calls those spam now.

Now supposing I really want porn spam, and the raunchier the better, it's
easy to train a classifier to call such stuff ham. If this filter
technology reaches enough people that the fraction of a fraction of a
percent of those who really want porn spam get hold of it, they won't miss
porn spam anymore in the blizzard of spams they don't want to see, and
response rates for porn spammers may well go *up*. But it will be in the
interest of the porn spammers then not to try to disguise the nature of
their msg; to the contrary, it will be in their interest to have it SCREAM
"porn spam".

Substitute get-rich-quick, or penis enlargement, or what have you.

> Somwhat back on the Python topic, once SpamBayes is done I intend to see
> if I can implement what I talked about.

It's long been done enough for geeks to use effectively. The killers are
integrating with a gazillion quirky mail clients, and making a system so
easy to use that you don't have to learn anything to use it.

> It's just not worth picking up an implementation in another language
> when it'd probably be a small handful of hours' work in Python...

The classifier proper is very simple and brief Python code. The tokenizer
is hairier, but still not a major piece of work. The hairiest code by far
is Mark Hammond's wondrous integration code for Outlook 2000.

Paul Rubin

unread,

Jan 13, 2003, 10:13:51 PM1/13/03

to

Tim Peters <tim...@comcast.net> writes:
> Cursed to do so, yes <wink>. Trying to train one of these classifiers to
> serve a diverse group of users at once is demonstrably and quantifiably much
> less effective.

Yes, the hope is you get some of the effectiveness back by giving
extra weight to words found in recently received spam. The
observation is individual pieces of spam tend to circulate for fairly
short periods, so if you spot words from them during that period, that
tells you something even if the messages mutate (all the similar
Nigerian spams).

By the way, here's a hysterically funny variation ("urgent
counter-proposal") on the Nigerian spam:

http://www.nightsong.com/phr/urgent.txt

It's from comp.dcom.telecom and I saved it.

Michael Hudson

unread,

Jan 14, 2003, 7:32:00 AM1/14/03

to

-$P-W$-@verence.demon.co.uk (Paul Wright) writes:

> That said, the obvious spammer
> response when people do that is to make messages which are more and more
> dissimilar for each recipient, again something where human malice can
> probably defeat automated attempts to find similar messages.

Two points spring to mind:

1) having to modify the message for each recipient makes the spammer's
job harder -- a good thing.

2) the extreme so-unlikely-it's-never-going-to-happen extension of
this line is that people only send me marketing material *I'm
actually interested in* -- which would also be a good thing.

> The DCC's creator has said that he thinks that it will eventually be
> most useful against "mainsleaze", that is, spam from big businesses
> who will not want to use the sort of filter-evading tactics which
> are popular with the "enlarge your naked cheerleaders"[1] crowd.

A reflection of this fact is that spammers are clearly testing their
mails against SpamAssassin, whereas some mailshots I get, and want to
get, (the Apple Developer Connection News being a striking example)
frequently get flagged by SA as spam.

I'd install spambayes, but the starship is so thoroughly & viciously
protected *anyway* that I only get a tiny amount of spam as it is...

Cheers,
M.

--
[Perl] combines all the worst aspects of C and Lisp: a billion
different sublanguages in one monolithic executable. It combines
the power of C with the readability of PostScript. -- Jamie Zawinski

Paul Wright

unread,

Jan 14, 2003, 8:27:49 AM1/14/03

to

In article <mailman.104249066...@python.org>,
Tim Peters <tim...@comcast.net> wrote:
>[Paul Wright]

>> <http://www.jerf.org/irights/2002/11/18.html> argues that human malice
>> can and will defeat Bayesian filters, and that widespread adoption of
>> them will end up making spam harder to recognize by hand.

...

>He's probably right that the way to beat this generation of filters is to
>create spam statistically indistinguishable from ham. The unknown not
>addressed there is that all forms of advertising are a percentage game, and
>current spam uses (e.g.) ALL CAPS and huge fonts and bright colors because
>those tricks increase response rate. Spam so bland that it looks like it
>came from your grandmother may not draw a response rate large enough to
>repay the costs of spamming (which, while tiny on a per-msg basis, aren't
>zero).

Indeed. However, I am seeing a lot of "minimalist" spam which is
obviously intended to evade body filtering: usually just a URL and a
hashbuster. I imagine that they're banking on people being curious
enough to click the link. I'm planning on dealing with short spam like
this by looking up the website host IP in blacklists, but it's not quite
enough of a problem to worry about yet.

>> [1] How many boneheaded keyword filters will now bounce this post when
>> it goes out as mail on the python list, I wonder? There's an awful lot
>> of snake oil out there being sold as spam filters.
>
>I received it via the c.l.py mailing list gateway. My personal spambayes
>filter gave your msg a score of 0.9998 for haminess (0.0 = spam, 1.0 = ham),

Indeed. I don't think the Bayesian stuff is snake oil. However, mailing
list operators often complain about broken filters which seem to operate on
single key phrases (such as "Viagra" or "my pictures") in isolation,
causing legitimate discussion to get filtered. Someone out there is
probably making money selling these filters to big business, alas. See
<http://groups.google.com/groups?selm=ahkddc%24b7a%241%40verence.demon.co.uk>

Anthony Baxter

unread,

Jan 14, 2003, 9:06:21 AM1/14/03

to

>>> Michael Hudson wrote

> I'd install spambayes, but the starship is so thoroughly & viciously
> protected *anyway* that I only get a tiny amount of spam as it is...

One somewhat ironic point is that python.org is also very heavily
spam-proofed - this results in any spams that do slip through the
net getting a bunch of high ham-value clues (from received lines, &c).

On the other hand, the system _is_ doing it's job - after all, the
presence of python.org headers indicates that it's less likely to be
spam... and it's rare that these headers alone will push the message
into the (extremely small :) window that I consider ham.

--
Anthony Baxter <ant...@interlink.com.au>
It's never too late to have a happy childhood.

Skip Montanaro

unread,

Jan 14, 2003, 11:22:20 AM1/14/03

to

Skip> ... Recent case in point, lots of spam coming from
Skip> "b...@boss.com"...

Which I now learn is a new virus spreading through the Windows/Outlook
community.

S

Skip Montanaro

unread,

Jan 14, 2003, 11:11:58 AM1/14/03

to

(I haven't the slightest idea if Paul's email address is valid. Sure looks
weird.)

Paul> Indeed. However, I am seeing a lot of "minimalist" spam which is
Paul> obviously intended to evade body filtering: usually just a URL and
Paul> a hashbuster. I imagine that they're banking on people being
Paul> curious enough to click the link. I'm planning on dealing with
Paul> short spam like this by looking up the website host IP in
Paul> blacklists, but it's not quite enough of a problem to worry about
Paul> yet.

Spambayes already looks at URLs. Minimalist url-containing spam such as you
mention tends to wind up "unsure" until I train on it. Recent case in
point, lots of spam coming from "b...@boss.com". Your message had nearly 20
url:* tokens in it according to Spambayes tokenizer (sorted here from hammy
to spammy):

'url:python-list': 0.01
'url:selm': 0.01
'url:listinfo': 0.02
'url:mailman': 0.02
'url:python': 0.02
'url:demon': 0.05
'url:org': 0.06
'url:groups': 0.09
'url:mail': 0.09
'url:google': 0.14
'url': 0.35
'url:com': 0.63
'url:html': 0.68
'url:www': 0.69
'url:co': 0.85
'url:pobox': 0.90
'url:18': 0.92
'url:11': 0.94

And it's faster (and probably more accurate) than consulting an off-site
oracle to boot.

I've been using Spambayes since before November 1 (my oldest .procmailrc
backup file). I see no false positives and a modest number of unsures and
false negatives. Much better than pre-Spambayes (which was SpamAssassin).
After the initial big training run (lots of both ham and spam), I have only
been training on unsure or incorrectly classified messages.

Skip

Paul Rubin

unread,

Jan 14, 2003, 12:18:11 PM1/14/03

to

Skip Montanaro <sk...@pobox.com> writes:
> Spambayes already looks at URLs. Minimalist url-containing spam such as you
> mention tends to wind up "unsure" until I train on it. Recent case in
> point, lots of spam coming from "b...@boss.com". Your message had nearly 20
> url:* tokens in it according to Spambayes tokenizer (sorted here from hammy
> to spammy):

Does spambayes look at the charset? I get tons of spam in korean
characters. Anything with charset="euc-kr" or "ks_c_5601-1987" etc.
is just about certainly spam.

Spambayes is already working better than spamassassin? Wow. I guess
I'll look into switching. It's seemed to me up til now that it really
takes a mixture of dynamic (Bayesian) and hand-coded (SA) filtering
I've heard the next version of SA will incorporate Bayesian filtering
in addition to what it already does.

Skip Montanaro

unread,

Jan 14, 2003, 12:27:48 PM1/14/03

to

Paul> Does spambayes look at the charset?

Yup:

'charset:us-ascii': 0.17

Skip

David Mertz

unread,

Jan 14, 2003, 2:50:53 PM1/14/03

to

Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote previously:

|Does spambayes look at the charset? I get tons of spam in korean
|characters. Anything with charset="euc-kr" or "ks_c_5601-1987" etc.
|is just about certainly spam.

I have a custom filter setup on my machine. It's a bit cobbled together
with duct tape and string, so I'm not exactly advocating it. I should
probably start using spambayes, but I'd need to write some wrapper for
my particular use model.

What I do (in a Python script) is poll my POP3 mailbox intermittently,
and download the headers only. If I decide something is definitely spam
based on the headers, I send a delete command, and never need to
download the whole message (i.e. a large virus body) with my regular
mail client. I like this because the spam-killer script is completely
independent of which mail client I use.

I analyze the headers twice. The first time looks for some values that
I manually entered, specific to header fields (e.g. "URGENT ASSISTANCE"
in the Subject:). Mostly I just started using this crude style first,
and didn't remove it. But then I make a second pass using a
pseudo-Bayesian analysis of the *trigrams* in the header. I think
trigrams work nicely for headers, which contain distinctive substrings,
but not so many whole words. I wrote about this a bit at:

http://www-106.ibm.com/developerworks/linux/library/l-spamf.html

One thing I look for in the first pass is several of those east Asian
charset strings. The way I figure it, even though I might get perfectly
welcome mail from Korean correspondents, if they are encoded in Korean,
I can't read them anyway. Of course, some people *do* read Korean (or
Chinese, Japanese, etc), so this filter clearly wouldn't work for them.

I've noticed, however, that the manual filters are usually redundant.
Almost everything that the patterns I hand coded catch are then also
caught by the trigram-bayes style.

Yours, David...

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------

Jeff Epler

unread,

Jan 14, 2003, 3:07:07 PM1/14/03

to

On Tue, Jan 14, 2003 at 12:32:00PM +0000, Michael Hudson wrote:
> A reflection of this fact is that spammers are clearly testing their
> mails against SpamAssassin, whereas some mailshots I get, and want to
> get, (the Apple Developer Connection News being a striking example)
> frequently get flagged by SA as spam.

They must not be working too hard. I have a message here with a very
spammy "copy your dee vee dee" body taken from real spam plus a few
headers and footers that gets -15 points on spamassassin 2.43 (default
weights, no whitelist or auto-whitelist). Or maybe 2.43 isn't their
target..

Jeff

Tim Peters

unread,

Jan 14, 2003, 9:47:18 PM1/14/03

to

[Paul Rubin]
> ...

> Spambayes is already working better than spamassassin? Wow.

It depends on what you use it for. It was intended to be used by a single
person on their own email, and it quickly learns so much about a single
person's quirks that even very early versions of the spambayes code did at
least as well as a well-maintained SpamAssassin.

For group use your mileage will vary. General single-classifier tests on
all the email traffic going thru python.org tried to exclude personal email
accounts, leaving just the Python and Zope mailing list traffic, and a
number of small, private, special-interest mailing lists. We know how well
spambayes did in those bests, but aren't so sure about how well SA did; I do
know it caught a lot of spam that got beyond SA. OTOH, python.org rejects
many msgs before SA ever sees them, so we have no idea how either system
would do on those.

The tests turned up one class of msg where SA had a real advantage: very
brief administrivia requests to *-request addresses. The ones that caused
trouble were typically a single-word msg like "unsubscribe" (itself a word
with high spamprob!), followed by a forward of a spam or off-topic
"conference announcement" that had leaked thru on the mailing list, *and*/or
a dozen of kilobytes of employer-generated HTML disclaimers ("whirlygigs.com
is a regulated investment company, and is not responsible for the etc etc
etc"). An appreciable fraction of a percent of administrivia msgs look like
that. SA did better on those because python.org's SA installation is tuned
to give a huge "ham boost" to any email sent to a *-request address.
spambayes has no gimmicks like that.

Some of the personal email that snuck thru was also troublesome. Everyone
signs up for *some* HTML newsletters that most other people would consider
to be spam. Train a single classifier to accept the financial newsletters I
want to see, and the classifier becomes weaker at weeding out "similar"
stuff for other people. Or if you happen to be resigned to the size of your
trouser snake and would rather not be reminded of it, training a shared
classifier to reject penis-enlargement spam stops Barry from getting the
help he so desperately needs.

> I guess I'll look into switching. It's seemed to me up til now that
> it really takes a mixture of dynamic (Bayesian)

There's really nothing Bayesian about the spambayes code, except for a
Bayesian adjustment to the estimates of individual words' spam
probabilities. The probability combining scheme isn't Bayesian at all. An
article by Gary Robinson about the math behind spambayes will be published
in Linux Journal soon, followed the next month with an article by Richie
Hindle about the more practical aspects of the system.

> and hand-coded (SA) filtering

For use by a group of unrelated individuals (say, an ISP, or corporate email
server), I expect that's true.

> I've heard the next version of SA will incorporate Bayesian filtering
> in addition to what it already does.

SpamAssassin's Matt Sergeant hung out on the spambayes mailing list for
quite a while, and picked up some number of the techniques for SA's use.
More power to 'em, although I no longer have a" spam problem" so stopped
paying attention <wink -- but I still get about 100 spam a day, and it all
ends up in my spam folder now>.

Anthony Baxter

unread,

Jan 14, 2003, 10:27:12 PM1/14/03

to

>>> Tim Peters wrote

> [Paul Rubin]
> > ...
> > Spambayes is already working better than spamassassin? Wow.
>
> It depends on what you use it for. It was intended to be used by a single
> person on their own email, and it quickly learns so much about a single
> person's quirks that even very early versions of the spambayes code did at
> least as well as a well-maintained SpamAssassin.

I know that for _my_ use, it kicks spamassassin's butt.

I still have spamassassin running over the inbound email (but in non-mangle-
the-spam mode), so I can check occasionally how it's going.

For the last 100 spams in my spam folder, spambayes nailed 92 as spam,
and 8 as unsure. There were no false negatives (missed spam), although
I seem to see about 1 a week or so of those, and I've yet to get a
false positive. SA tells me that 86 of these 100 are spam, and 14 are not.
When I was using SA regularly, I had to put an enormous set of whitelist
addresses to let things through like Blackstar's regular mailouts and
other commercial email that I wanted to see, as well as for things like
RISKS digest (which, for some reason, SA _really_ hated). This is
obviously a hole that spammers will try to exploit - I already see
spams with sender addresses @amazon.com, presumably to try and slip
through this sort of hole.

Hopefully SA will pick up some of the techniques in a future version.

Anthony

Donn Cave

unread,

Jan 14, 2003, 11:35:57 PM1/14/03

to

Quoth me...@gnosis.cx (David Mertz):
...

| I've noticed, however, that the manual filters are usually redundant.
| Almost everything that the patterns I hand coded catch are then also
| caught by the trigram-bayes style.

Hand coded rules would probably work well to accept mail that the
statistical analysis might otherwise reject - e.g., Tim's example
of the conference invitations, any mail from your employer, etc.

Donn Cave, do...@drizzle.com

Michael Hudson

unread,

Jan 15, 2003, 11:31:40 AM1/15/03

to

Jeff Epler <jep...@unpythonic.net> writes:

> On Tue, Jan 14, 2003 at 12:32:00PM +0000, Michael Hudson wrote:
> > A reflection of this fact is that spammers are clearly testing their
> > mails against SpamAssassin, whereas some mailshots I get, and want to
> > get, (the Apple Developer Connection News being a striking example)
> > frequently get flagged by SA as spam.
>
> They must not be working too hard.

OK, "some spammers".

> I have a message here with a very spammy "copy your dee vee dee"
> body taken from real spam plus a few headers and footers that gets
> -15 points on spamassassin 2.43 (default weights, no whitelist or
> auto-whitelist). Or maybe 2.43 isn't their target..

A little confused -- isn't -15 verymuchnotspam in SA-land?

Cheers,
M.

--
I really hope there's a catastrophic bug insome future e-mail
program where if you try and send an attachment it cancels your
ISP account, deletes your harddrive, and pisses in your coffee
-- Adam Rixey