Suggestion on the Junk Mail filter

Stanimir Stamenkov

unread,

Feb 16, 2003, 3:08:06 PM2/16/03

to

It's very nice one can have such feature.

I have the following observations:

Let's suppose I haven't trained the filter. If I mark a plain text
message as not junk and I mark a HTML formatted message as junk all
HTML messages would be considered as junk afterwards (if I don't
train it anymore). That leads me to the conclusion the whole source
of the message body part is used for training: words including the
formatting HTML tags. I don't think this is right. I think the
message should be converted to what should be (plain) text
representation no matter which format it is in (even PDF if it could
be handled internally) and then this text to be used to train. I
think this way it would be more effective and would need less training.

Another one: It would be nice if similar filtering could be done on
the values (the text) of the 'Received' headers only (I don't know
if the current behavior gets all the message source or just the body
parts). This way it could filter spam messages by their origin. I
think this one would need even less training and will be similar to
the current anti-spam software techniques like matching against DNS
black lists with the only difference it would not use actual list
but it would build it itself.

--
Stanimir <stanio(_at_)gbg.bg>

Garth Wallace

unread,

Feb 16, 2003, 6:45:57 PM2/16/03

to

Stanimir Stamenkov wrote:
> It's very nice one can have such feature.
>
> I have the following observations:
>
> Let's suppose I haven't trained the filter. If I mark a plain text
> message as not junk and I mark a HTML formatted message as junk all HTML
> messages would be considered as junk afterwards (if I don't train it
> anymore). That leads me to the conclusion the whole source of the
> message body part is used for training: words including the formatting
> HTML tags. I don't think this is right. I think the message should be
> converted to what should be (plain) text representation no matter which
> format it is in (even PDF if it could be handled internally) and then
> this text to be used to train. I think this way it would be more
> effective and would need less training.

Actually, this defeats the purpose of Bayesian filtering. Bayesian
filtering is meant to be done on the messages in the form they are
received. That way the filter can catch any element of a message that
only (or most often) appears in spam. If the string "<marquee", which is
a fragment of an HTML tag, shows up only in the source of a spam, it
will be used as evidence that a message is spam, even though it is not
part of the readable text of the message (but its presence is still
apparent in the message...it makes text move back and forth). Similarly,
for most people, the presence of the string "<script" is evidence of
spam (since benign messages rarely contain JavaScript code). Stripping
the tags from the message before filtering would remove this useful data
from the filter's pool of evidence, and would make the Bayesian
filtering less effective, not more.

BTW, PDF is not handled internally, it is handled by a plugin (since the
format and technology is owned by Adobe Systems). Normalizing PDF to
text before filtering would be a whole 'nother can of worms, but since
PDF spam is exceedingly rare (if any exists at all), it really isn't a
big consideration.

> Another one: It would be nice if similar filtering could be done on the
> values (the text) of the 'Received' headers only (I don't know if the
> current behavior gets all the message source or just the body parts).
> This way it could filter spam messages by their origin. I think this one
> would need even less training and will be similar to the current
> anti-spam software techniques like matching against DNS black lists with
> the only difference it would not use actual list but it would build it
> itself.

Current behavior, at least as described in the original essay on
Bayesian filtering, is that the entire message source is used, so that
if any aspect of a message is representative of spam (including
suspicious headers), it will be picked up and used as evidence when
filtering.

--
Mozilla 1.0 Guide: http://www.mozilla.org/start/1.0/guide/
Mozilla 1.0 FAQ: http://www.mozilla.org/start/1.0/faq/

End-user discussion and peer support:
snews://secnews.netscape.com:563/netscape.mozilla.user.general
snews://secnews.netscape.com:563/netscape.mozilla.user.win32
snews://secnews.netscape.com:563/netscape.mozilla.user.mac
snews://secnews.netscape.com:563/netscape.mozilla.user.unix

Bill McCartney

unread,

Feb 16, 2003, 7:18:25 PM2/16/03

to

On 2/16/2003 6:45 PM, Garth Wallace spoke thusly:

At present I have a cache of 134 spam mails that Mozilla Will Not mark
as spam.For this reason I still feel the need for implimentation of
White/Black lists since this is showing me that the Bayesian method will
Never be 100% effective although I am at about 95% effective now after
training on 3500 spam mails and about 600 non spam mails.

--
Bill McCartney
Union City, Georgia

E-mail rot 13ed
ovyy...@oryyfbhgu.arg

John Whitling

unread,

Feb 17, 2003, 12:35:58 AM2/17/03

to

I sure would like some advice on how to use this junk filter. I'm
obviously doing something wrong because it's not doing anything. I
couldn't find anything in the help section about using it. Can you
forward me to a link?

Bill McCartney

unread,

Feb 16, 2003, 11:13:24 PM2/16/03

to

On 2/17/2003 12:35 AM, John Whitling spoke thusly:

There is little documentation.
First go to Tools/Junk Mail Controls/ and enable the controls for all
mail accounts you want to filter(only works on Mail not News but the
training.dat file can be trained on spam in the newsgroups)Second if you
want the spam moved to the Junk folder then enable that option.Third if
you don't have a Junk column in the mail window (mine
Subject/Sender/Date/Trashcan icon for Junk)you can enable it by clicking
the little icon to the right of Date and clicking the Junk Status on.
Now Junk controls do nothing until you start training the controls.When
you get spam you can train in three ways.Tools/Mark selected Messages as
Junk or the Junk button on the toolbar or you can toggle the Junk
status in the Junk column next to Date.The more you train the better it
gets.If you get a mail marked as Junk that is not spam then Tools/Mark
Selected Messages as Not Junk.

Stanimir Stamenkov

unread,

Feb 17, 2003, 5:26:41 AM2/17/03

to

Garth Wallace wrote:

O.k. I've thought after that and realized that filtering all
elements and not just text is important, too. But I think there
should be mechanism for triggering different kind of approaches,
different targets on which the Bayesian filter acts.

> BTW, PDF is not handled internally, it is handled by a plugin (since the
> format and technology is owned by Adobe Systems). Normalizing PDF to
> text before filtering would be a whole 'nother can of worms, but since
> PDF spam is exceedingly rare (if any exists at all), it really isn't a
> big consideration.
>

I know PDF is handled by a plug-in and it is proprietary format,
etc. I've just gave it as example: "if it could be handled
internally". Am I clear enough?

>> Another one: It would be nice if similar filtering could be done on
>> the values (the text) of the 'Received' headers only (I don't know if
>> the current behavior gets all the message source or just the body
>> parts). This way it could filter spam messages by their origin. I
>> think this one would need even less training and will be similar to
>> the current anti-spam software techniques like matching against DNS
>> black lists with the only difference it would not use actual list but
>> it would build it itself.
>>
>
> Current behavior, at least as described in the original essay on
> Bayesian filtering, is that the entire message source is used, so that
> if any aspect of a message is representative of spam (including
> suspicious headers), it will be picked up and used as evidence when
> filtering.
>

The main idea for filtering just the headers is that you don't need
to get the whole message to determine if it is SPAM or not (this is
a case with IMAP accounts).

I suppose there could be kept two "training" files: one produced
only from the headers and one from the body parts (or the whole
message). When a mail is determined to be SPAM only from the headers
it stays that way until you mark it as not junk. If the message is
determined as not junk only from the headers it could be filtered
further for the body contents.

I'm just giving a few examples. Sure it needs more thinking and
design. But for now I think using the Bayesian algorithm to filter
the messages only for their origin (not the hole headers, just the
'Received' ones) would be most effective and least expensive (time
for training, size of the training database).

--
Stanimir <stanio(_at_)gbg.bg>

John Whitling

unread,

Feb 18, 2003, 12:08:19 AM2/18/03

to

Thanks for the tips. However, I have pretty much been doing what you're
talking about, including marking mail. One thing that doesn't work is
that my "move junk mail to" settings are grayed out in the dialog box. I
thought that it might be because there was no "Junk Mail" folder so I
created one. That didn't help any. Any tips?

Thanks

miles clark

unread,

Feb 18, 2003, 1:23:16 AM2/18/03

to

> Actually, this defeats the purpose of Bayesian filtering. Bayesian
> filtering is meant to be done on the messages in the form they are
> received. That way the filter can catch any element of a message that
> only (or most often) appears in spam. If the string "<marquee", which is
> a fragment of an HTML tag, shows up only in the source of a spam, it
> will be used as evidence that a message is spam, even though it is not
> part of the readable text of the message (but its presence is still
> apparent in the message...it makes text move back and forth). Similarly,
> for most people, the presence of the string "<script" is evidence of
> spam (since benign messages rarely contain JavaScript code). Stripping
> the tags from the message before filtering would remove this useful data
> from the filter's pool of evidence, and would make the Bayesian
> filtering less effective, not more.

There seems to be a lot of debate amoung the statistical filtering
projects on how to 'tokenize.' If you exclude html, then you may lose
some important tags (<font color=red> seems to be high on many folks
spam words). If you dont exclude it, you leave yourself open to a
bunch of spam tricks (I'm calling them spam sushi) -- things like
inserting random jibberish tags: h<jibberish)ow ar<325klj>e you<alkfh>
etc. Also, the community is pretty well split on things like, 'should
you include capitalization and punctuation: is free different than
FREE!!! There are definately some benefits to stripping, but there
are some serious downsides too.

Alan Beagley

unread,

Feb 18, 2003, 1:29:18 AM2/18/03

to

That option was implemented only part-way through the 1.3b cycle, IIRC.

-=-
Alan

John Whitling

unread,

Feb 18, 2003, 11:32:21 AM2/18/03

to

Ahh! Thanks.

Bill McCartney

unread,

Feb 18, 2003, 12:04:07 PM2/18/03

to

On 2/18/2003 12:08 AM, John Whitling spoke thusly:

Update to a newer version to get the move function.When downloading
highlight the ACCOUNT name not inbox or any others and then when you
open Inbox/Junk/or whatever the mails will be classified and moved.The
only way to classify when downloading while on the Inbox is Tools/Run
Junk Mail Controls on Selected Messages but then the move function
doesn't work after the fact.

Kent Brewster

unread,

Feb 18, 2003, 6:42:27 PM2/18/03

to

Bill McCartney wrote:

> There is little documentation.
> First go to Tools/Junk Mail Controls/ and enable the controls for all
> mail accounts you want to filter(only works on Mail not News but the
> training.dat file can be trained on spam in the newsgroups)Second if you
> want the spam moved to the Junk folder then enable that option.Third if
> you don't have a Junk column in the mail window (mine
> Subject/Sender/Date/Trashcan icon for Junk)you can enable it by clicking
> the little icon to the right of Date and clicking the Junk Status on.
> Now Junk controls do nothing until you start training the controls.When
> you get spam you can train in three ways.Tools/Mark selected Messages as
> Junk or the Junk button on the toolbar or you can toggle the Junk
> status in the Junk column next to Date.The more you train the better it
> gets.If you get a mail marked as Junk that is not spam then Tools/Mark
> Selected Messages as Not Junk.

For what it's worth, the people behind this particular sub-project of
Mozilla have my eternal undying gratitude: late last year I was
unfortunate enough to have my address put down as the reply-to for
several very widely-distributed spam messages, and I probably would have
lost my business if it weren't for the Bayesian filter.

I have 1.3b working (mostly, see below) at home but here at work I can't
get the Junk Mail Controls dialog to do anything at all. I bring it up,
check Enable Junk Mail Controls, click OK, and nothing happens ... the
dialog just sits there until I hit Cancel, and when I bring it back up,
Enable Junk Mail Controls has not been checked.

At home, 1.3b works fine, including Enable Junk Mail Controls. I am
still training my In box, and am puzzled by the fact that it won't sort
by the Junk icon. I still have to manually click the Read flag on and
off if I want to sort.

Any ideas on how I could get by either of these problems would be
greatly appreciated. Does anybody know where the junk mail preferences
are actually stored?

Thanks very much,

Kent Brewster
http://www.speculations.com

Bill McCartney

unread,

Feb 18, 2003, 9:44:23 PM2/18/03

to

On 2/18/2003 6:42 PM, Kent Brewster spoke thusly:

You seem to be using the 1.3b Official release.I have no problem with
the 20030214 nightly.

Max

unread,

Feb 19, 2003, 3:49:41 PM2/19/03

to

Alan or John,

Had the same problem, when downloading a daily, and let load over an
existing version.

Got rid of the problem by erasing "XUL.MFL" file. This file is the last
one in your profile. Actually didn't erase it, just rename it, until I
saw what would happen. Then erased it.

Surprise, surprise, the problem went away. The Junk mail function now
moves all the junk to the Trash file (have it set to Trash).

Give it a try,

Max

Stanimir Stamenkov

unread,

Mar 9, 2003, 7:28:10 PM3/9/03

to

Stanimir Stamenkov wrote:

> Garth Wallace wrote:
>
>>> [...]

>>
>> Current behavior, at least as described in the original essay on
>> Bayesian filtering, is that the entire message source is used, so that
>> if any aspect of a message is representative of spam (including
>> suspicious headers), it will be picked up and used as evidence when
>> filtering.
>>
>
> The main idea for filtering just the headers is that you don't need to
> get the whole message to determine if it is SPAM or not (this is a case
> with IMAP accounts).
>
> I suppose there could be kept two "training" files: one produced only
> from the headers and one from the body parts (or the whole message).
> When a mail is determined to be SPAM only from the headers it stays that
> way until you mark it as not junk. If the message is determined as not
> junk only from the headers it could be filtered further for the body
> contents.
>
> I'm just giving a few examples. Sure it needs more thinking and design.
> But for now I think using the Bayesian algorithm to filter the messages
> only for their origin (not the hole headers, just the 'Received' ones)
> would be most effective and least expensive (time for training, size of
> the training database).
>

O.k. I've thought for awhile and I came with an idea which could be
used for general email classification and not just junk filtering.

There could be place in the Mail & Newsgroups preferences to define
different categories for classification and of course "Junk" will be
predefined. My idea is that every category could be customized to
get different source for its training - not just the whole message.
That's what I've meant with "mechanism for triggering different

targets on which the Bayesian filter acts".

For example I could define a category to get the contents only of
the 'Received' and 'From' header fields for training. Later only
this contents (of the 'Received' and 'From' headers) should be
compared with the collected data when matching for this category.

I've some final touch thoughts (like changes to the UI) but the
above is the general idea which I think is simple enough and gives
far more freedom for creativity on the users' side.

--
Stanimir <stanio(_at_)gbg.bg>