/forummsg/99 got me thinking, how can we fingerprint spam effectively?
Just md5summing the entirety of it won't due, since many spams use a
"randomness" factor, so, here is a way I think we could accurately
fingerprint spam.
First, we take the subject, remove all "Re:", "Fwd:", etc tags and MD5
what's left, to be the first print
Then, we take the spam sample, remove all formatting (tags, whitespace,
everything except a stream of numbers and text) and we md5sum the first
250 chars. This becomes FingerPrint2.
Next, we take all URLs in the email, remove anything that is obviously
wrong like multiple /////'s after a url and ?parameters so we end up
with a clean list of URLs, and we then remove all duplicates. We then
md5sum each URL left and make those FingerPrint3,4,etc.
Next, we take the "from" email address domain (eg hotmail.com), md5 it,
and make it a print.
Finally, we take the server routing information, and make each server
name the mail passed though a print.
Now, we end up with a list of "prints" for the spam message. We can
print every message this way and then check for matches of 65% and
greater (not sure how accurate we want). So if two messages each have 7
prints, and 6 of those match (85%), we match those as a single message
chain.
We could also make prints have more importance. So, subject has 1, body
has 2, urls have 2, from addresses 3, and servers 4 so if two have the
same subject, they'll be paired with a higher percentage.
Robin Monks
RoBiN {At} GmKiNg [dOt] OrG
This is just my 2 cents, obviously we need more input and suggestions.
Cheers!
Jug.
Replied at http://www.okopipi.org/forummsg/100#comment-326
Also since Google released Tesseract as open source[1], we could
certainly decode anything stored as an image. Easier than captchas,
because the spammer _wants_ us to be able to easily read their
message.
[1] http://sourceforge.net/projects/tesseract-ocr
--
Open Source, Open Mind
It works with webmail, thunderbird, or any windows program that uses
pop3 and smtp. I don't see any mention of imap on their page, other
than here:
http://www.google.com/search?q=site%3Amessagelevel.com+imap