Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Spam collection

2 views
Skip to first unread message

Mikkel Rasmussen

unread,
May 1, 2001, 5:27:31 AM5/1/01
to
Disclaimer: This is not just for Python programmers (since I use Python I
thought it would be nice to co-operate with other Python programmers).

I have thought about sharing my spam collection with others for use in
developing a better spam filter. We need a large collection of spam to be
able to do various forms of analysis on it. I don't know if such a
collection already exists. If so, I would like to add mine.

My spam filter "idea" is to use keywords, because I use Outlook and Outlook
does not give any other possibilities (as far as I know). The problem is in
choosing the best keywords without using *any* word that occurs in a
non-spam message.

We probably also need a definition of spam. A tentative definition could be
"irrelevant messages" where irrelevant gives a subjective perspective. My
spam might not be your spam :-)

Any further ideas?

Mikkel Rasmussen


Aidan Finn

unread,
May 1, 2001, 6:24:05 AM5/1/01
to
In article <pkvH6.89$Qj7....@news.get2net.dk>, "Mikkel Rasmussen"
<foo...@get2net.dk> wrote:


> My spam filter "idea" is to use keywords, because I use Outlook and
> Outlook does not give any other possibilities (as far as I know). The

> problem is in choosing the best keywords ...

You might try the rainbow text classifier
(http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html)
to find discover the most informative words for junk e-mail.
There are some papers on using baysian classification to do this kind of
filtering. The paper "A Bayesian Approach to Filtering Junk E-Mail" and
kushmericks adeater system spring to mind. If your interested these can
probably be found on citeseer (http://citeseer.nj.nec.com/cs).


Let me know if this is useful.

AF

0 new messages