Fingerprinting spam

4 views
Skip to first unread message

mozillaman

unread,
Sep 9, 2006, 7:03:55 PM9/9/06
to okopipi-dev
[I'm also posting this message here as it seems the forum (where this
was originally posted http://www.okopipi.org/forummsg/100 ) get little
to no traffic. I'm interesting in helping develop Okopipi, and
following are some of my ideas on fingerprinting spam, enjoy!]

/forummsg/99 got me thinking, how can we fingerprint spam effectively?
Just md5summing the entirety of it won't due, since many spams use a
"randomness" factor, so, here is a way I think we could accurately
fingerprint spam.

First, we take the subject, remove all "Re:", "Fwd:", etc tags and MD5
what's left, to be the first print

Then, we take the spam sample, remove all formatting (tags, whitespace,
everything except a stream of numbers and text) and we md5sum the first
250 chars. This becomes FingerPrint2.

Next, we take all URLs in the email, remove anything that is obviously
wrong like multiple /////'s after a url and ?parameters so we end up
with a clean list of URLs, and we then remove all duplicates. We then
md5sum each URL left and make those FingerPrint3,4,etc.

Next, we take the "from" email address domain (eg hotmail.com), md5 it,
and make it a print.

Finally, we take the server routing information, and make each server
name the mail passed though a print.

Now, we end up with a list of "prints" for the spam message. We can
print every message this way and then check for matches of 65% and
greater (not sure how accurate we want). So if two messages each have 7
prints, and 6 of those match (85%), we match those as a single message
chain.

We could also make prints have more importance. So, subject has 1, body
has 2, urls have 2, from addresses 3, and servers 4 so if two have the
same subject, they'll be paired with a higher percentage.

Robin Monks
RoBiN {At} GmKiNg [dOt] OrG

Juggernaut

unread,
Sep 10, 2006, 1:55:08 AM9/10/06
to okopipi-dev
I agree with your approach, however we will need to define a higher
logic / pattern search in mails.
The other day, I observed that I had two very similar looking spams :
Both had different senders, different routes and mail text (garbage
text) but the common thing between them was an image. This image was
attached with the message and had exactly the same content (hoax stock
alert msg).

This is just my 2 cents, obviously we need more input and suggestions.

Cheers!
Jug.

Robin Monks

unread,
Sep 10, 2006, 2:19:10 PM9/10/06
to okopi...@googlegroups.com
Responded on http://www.okopipi.org/forummsg/100#comment-325

Robin
--
Robin Monks,
CivicSpace Release Engineer - http://civicspacelabs.com
Drupal Marketing Coordinator - http://drupal.org
Encrypt! http://tinyurl.com/ffo3l - http://www.gpg4win.org/

Juggernaut

unread,
Sep 10, 2006, 3:28:51 PM9/10/06
to okopipi-dev

Roger Filomeno

unread,
Sep 12, 2006, 1:11:40 AM9/12/06
to okopi...@googlegroups.com
I couldnt access the forum (site timed out) so im posting my reply here.

Bayesian spam filtering worked[1] for me, it can be used to either build a white list filter or black list filter. I still think it needs something extra like the ability to actually follow thru the link included but it might be dangerous if the link contain XSS. I agree that image finger-printing will be client intensive but it can be reserved in case the bayesian filtering fails. Also its better if the text on the image can be extracted using the same method in defeating captha's so that the bayesian filter can read it also.

[1]http://www.paulgraham.com/better.html

just my 2cents.




--
--
MSG GODIE <YOUR MESSAGE>
then send to 2948 for Globe/Sun and 3940 for Smart. Get yours FREE at www.TxtDomain.com
--
Roger P. Filomeno
Mobile Specialist / R&D
Finger Apps Inc, http://fingerapps.com
Blog: http://corruptedpartition.blogspot.com/

Kevin Winter

unread,
Sep 12, 2006, 1:16:36 AM9/12/06
to okopi...@googlegroups.com
On 9/12/06, Roger Filomeno <aliena...@gmail.com> wrote:
> might be dangerous if the link contain XSS. I agree that image
> finger-printing will be client intensive but it can be reserved in case the
> bayesian filtering fails. Also its better if the text on the image can be
> extracted using the same method in defeating captha's so that the bayesian
> filter can read it also.

Also since Google released Tesseract as open source[1], we could
certainly decode anything stored as an image. Easier than captchas,
because the spammer _wants_ us to be able to easily read their
message.

[1] http://sourceforge.net/projects/tesseract-ocr

--
Open Source, Open Mind

Larry Vagina

unread,
Sep 24, 2006, 3:45:23 PM9/24/06
to okopipi-dev
This is offtopic, but there is a new way of authenticating (eg
digitally signing) emails, that doesn't require signing up for checking
messages, and signing up is free to sign your email:
http://www.messagelevel.com/

It works with webmail, thunderbird, or any windows program that uses
pop3 and smtp. I don't see any mention of imap on their page, other
than here:
http://www.google.com/search?q=site%3Amessagelevel.com+imap

Reply all
Reply to author
Forward
0 new messages