I have the following observations:
Let's suppose I haven't trained the filter. If I mark a plain text
message as not junk and I mark a HTML formatted message as junk all
HTML messages would be considered as junk afterwards (if I don't
train it anymore). That leads me to the conclusion the whole source
of the message body part is used for training: words including the
formatting HTML tags. I don't think this is right. I think the
message should be converted to what should be (plain) text
representation no matter which format it is in (even PDF if it could
be handled internally) and then this text to be used to train. I
think this way it would be more effective and would need less training.
Another one: It would be nice if similar filtering could be done on
the values (the text) of the 'Received' headers only (I don't know
if the current behavior gets all the message source or just the body
parts). This way it could filter spam messages by their origin. I
think this one would need even less training and will be similar to
the current anti-spam software techniques like matching against DNS
black lists with the only difference it would not use actual list
but it would build it itself.
--
Stanimir <stanio(_at_)gbg.bg>
Actually, this defeats the purpose of Bayesian filtering. Bayesian
filtering is meant to be done on the messages in the form they are
received. That way the filter can catch any element of a message that
only (or most often) appears in spam. If the string "<marquee", which is
a fragment of an HTML tag, shows up only in the source of a spam, it
will be used as evidence that a message is spam, even though it is not
part of the readable text of the message (but its presence is still
apparent in the message...it makes text move back and forth). Similarly,
for most people, the presence of the string "<script" is evidence of
spam (since benign messages rarely contain JavaScript code). Stripping
the tags from the message before filtering would remove this useful data
from the filter's pool of evidence, and would make the Bayesian
filtering less effective, not more.
BTW, PDF is not handled internally, it is handled by a plugin (since the
format and technology is owned by Adobe Systems). Normalizing PDF to
text before filtering would be a whole 'nother can of worms, but since
PDF spam is exceedingly rare (if any exists at all), it really isn't a
big consideration.
> Another one: It would be nice if similar filtering could be done on the
> values (the text) of the 'Received' headers only (I don't know if the
> current behavior gets all the message source or just the body parts).
> This way it could filter spam messages by their origin. I think this one
> would need even less training and will be similar to the current
> anti-spam software techniques like matching against DNS black lists with
> the only difference it would not use actual list but it would build it
> itself.
Current behavior, at least as described in the original essay on
Bayesian filtering, is that the entire message source is used, so that
if any aspect of a message is representative of spam (including
suspicious headers), it will be picked up and used as evidence when
filtering.
--
Mozilla 1.0 Guide: http://www.mozilla.org/start/1.0/guide/
Mozilla 1.0 FAQ: http://www.mozilla.org/start/1.0/faq/
End-user discussion and peer support:
snews://secnews.netscape.com:563/netscape.mozilla.user.general
snews://secnews.netscape.com:563/netscape.mozilla.user.win32
snews://secnews.netscape.com:563/netscape.mozilla.user.mac
snews://secnews.netscape.com:563/netscape.mozilla.user.unix
At present I have a cache of 134 spam mails that Mozilla Will Not mark
as spam.For this reason I still feel the need for implimentation of
White/Black lists since this is showing me that the Bayesian method will
Never be 100% effective although I am at about 95% effective now after
training on 3500 spam mails and about 600 non spam mails.
--
Bill McCartney
Union City, Georgia
E-mail rot 13ed
ovyy...@oryyfbhgu.arg
There is little documentation.
First go to Tools/Junk Mail Controls/ and enable the controls for all
mail accounts you want to filter(only works on Mail not News but the
training.dat file can be trained on spam in the newsgroups)Second if you
want the spam moved to the Junk folder then enable that option.Third if
you don't have a Junk column in the mail window (mine
Subject/Sender/Date/Trashcan icon for Junk)you can enable it by clicking
the little icon to the right of Date and clicking the Junk Status on.
Now Junk controls do nothing until you start training the controls.When
you get spam you can train in three ways.Tools/Mark selected Messages as
Junk or the Junk button on the toolbar or you can toggle the Junk
status in the Junk column next to Date.The more you train the better it
gets.If you get a mail marked as Junk that is not spam then Tools/Mark
Selected Messages as Not Junk.
O.k. I've thought after that and realized that filtering all
elements and not just text is important, too. But I think there
should be mechanism for triggering different kind of approaches,
different targets on which the Bayesian filter acts.
> BTW, PDF is not handled internally, it is handled by a plugin (since the
> format and technology is owned by Adobe Systems). Normalizing PDF to
> text before filtering would be a whole 'nother can of worms, but since
> PDF spam is exceedingly rare (if any exists at all), it really isn't a
> big consideration.
>
I know PDF is handled by a plug-in and it is proprietary format,
etc. I've just gave it as example: "if it could be handled
internally". Am I clear enough?
>> Another one: It would be nice if similar filtering could be done on
>> the values (the text) of the 'Received' headers only (I don't know if
>> the current behavior gets all the message source or just the body
>> parts). This way it could filter spam messages by their origin. I
>> think this one would need even less training and will be similar to
>> the current anti-spam software techniques like matching against DNS
>> black lists with the only difference it would not use actual list but
>> it would build it itself.
>>
>
> Current behavior, at least as described in the original essay on
> Bayesian filtering, is that the entire message source is used, so that
> if any aspect of a message is representative of spam (including
> suspicious headers), it will be picked up and used as evidence when
> filtering.
>
The main idea for filtering just the headers is that you don't need
to get the whole message to determine if it is SPAM or not (this is
a case with IMAP accounts).
I suppose there could be kept two "training" files: one produced
only from the headers and one from the body parts (or the whole
message). When a mail is determined to be SPAM only from the headers
it stays that way until you mark it as not junk. If the message is
determined as not junk only from the headers it could be filtered
further for the body contents.
I'm just giving a few examples. Sure it needs more thinking and
design. But for now I think using the Bayesian algorithm to filter
the messages only for their origin (not the hole headers, just the
'Received' ones) would be most effective and least expensive (time
for training, size of the training database).
--
Stanimir <stanio(_at_)gbg.bg>
Thanks
There seems to be a lot of debate amoung the statistical filtering
projects on how to 'tokenize.' If you exclude html, then you may lose
some important tags (<font color=red> seems to be high on many folks
spam words). If you dont exclude it, you leave yourself open to a
bunch of spam tricks (I'm calling them spam sushi) -- things like
inserting random jibberish tags: h<jibberish)ow ar<325klj>e you<alkfh>
etc. Also, the community is pretty well split on things like, 'should
you include capitalization and punctuation: is free different than
FREE!!! There are definately some benefits to stripping, but there
are some serious downsides too.
-=-
Alan
Update to a newer version to get the move function.When downloading
highlight the ACCOUNT name not inbox or any others and then when you
open Inbox/Junk/or whatever the mails will be classified and moved.The
only way to classify when downloading while on the Inbox is Tools/Run
Junk Mail Controls on Selected Messages but then the move function
doesn't work after the fact.
> There is little documentation.
> First go to Tools/Junk Mail Controls/ and enable the controls for all
> mail accounts you want to filter(only works on Mail not News but the
> training.dat file can be trained on spam in the newsgroups)Second if you
> want the spam moved to the Junk folder then enable that option.Third if
> you don't have a Junk column in the mail window (mine
> Subject/Sender/Date/Trashcan icon for Junk)you can enable it by clicking
> the little icon to the right of Date and clicking the Junk Status on.
> Now Junk controls do nothing until you start training the controls.When
> you get spam you can train in three ways.Tools/Mark selected Messages as
> Junk or the Junk button on the toolbar or you can toggle the Junk
> status in the Junk column next to Date.The more you train the better it
> gets.If you get a mail marked as Junk that is not spam then Tools/Mark
> Selected Messages as Not Junk.
For what it's worth, the people behind this particular sub-project of
Mozilla have my eternal undying gratitude: late last year I was
unfortunate enough to have my address put down as the reply-to for
several very widely-distributed spam messages, and I probably would have
lost my business if it weren't for the Bayesian filter.
I have 1.3b working (mostly, see below) at home but here at work I can't
get the Junk Mail Controls dialog to do anything at all. I bring it up,
check Enable Junk Mail Controls, click OK, and nothing happens ... the
dialog just sits there until I hit Cancel, and when I bring it back up,
Enable Junk Mail Controls has not been checked.
At home, 1.3b works fine, including Enable Junk Mail Controls. I am
still training my In box, and am puzzled by the fact that it won't sort
by the Junk icon. I still have to manually click the Read flag on and
off if I want to sort.
Any ideas on how I could get by either of these problems would be
greatly appreciated. Does anybody know where the junk mail preferences
are actually stored?
Thanks very much,
Kent Brewster
http://www.speculations.com
You seem to be using the 1.3b Official release.I have no problem with
the 20030214 nightly.
Had the same problem, when downloading a daily, and let load over an
existing version.
Got rid of the problem by erasing "XUL.MFL" file. This file is the last
one in your profile. Actually didn't erase it, just rename it, until I
saw what would happen. Then erased it.
Surprise, surprise, the problem went away. The Junk mail function now
moves all the junk to the Trash file (have it set to Trash).
Give it a try,
Max
> Garth Wallace wrote:
>
>>> [...]
>>
>> Current behavior, at least as described in the original essay on
>> Bayesian filtering, is that the entire message source is used, so that
>> if any aspect of a message is representative of spam (including
>> suspicious headers), it will be picked up and used as evidence when
>> filtering.
>>
>
> The main idea for filtering just the headers is that you don't need to
> get the whole message to determine if it is SPAM or not (this is a case
> with IMAP accounts).
>
> I suppose there could be kept two "training" files: one produced only
> from the headers and one from the body parts (or the whole message).
> When a mail is determined to be SPAM only from the headers it stays that
> way until you mark it as not junk. If the message is determined as not
> junk only from the headers it could be filtered further for the body
> contents.
>
> I'm just giving a few examples. Sure it needs more thinking and design.
> But for now I think using the Bayesian algorithm to filter the messages
> only for their origin (not the hole headers, just the 'Received' ones)
> would be most effective and least expensive (time for training, size of
> the training database).
>
O.k. I've thought for awhile and I came with an idea which could be
used for general email classification and not just junk filtering.
There could be place in the Mail & Newsgroups preferences to define
different categories for classification and of course "Junk" will be
predefined. My idea is that every category could be customized to
get different source for its training - not just the whole message.
That's what I've meant with "mechanism for triggering different
targets on which the Bayesian filter acts".
For example I could define a category to get the contents only of
the 'Received' and 'From' header fields for training. Later only
this contents (of the 'Received' and 'From' headers) should be
compared with the collected data when matching for this category.
I've some final touch thoughts (like changes to the UI) but the
above is the general idea which I think is simple enough and gives
far more freedom for creativity on the users' side.
--
Stanimir <stanio(_at_)gbg.bg>