Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Spam
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 26 - 50 of 93 - Collapse all  -  Translate all to Translated (View all originals) < Older  Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Pascal Costanza  
View profile  
 More options Jan 20 2004, 11:33 am
Newsgroups: comp.lang.lisp
From: Pascal Costanza <costa...@web.de>
Date: Tue, 20 Jan 2004 17:33:52 +0100
Local: Tues, Jan 20 2004 11:33 am
Subject: [OT] Re: Spam

Tim Bradshaw wrote:
> * Raymond Wiker wrote:

>>        SpamAssassin uses a number of filters, and add up the scores
>>for these filters. Tuning and configuration of SpamAssassin is not
>>straightforward; I wonder if it would be possible to apply genetic
>>algorithms to find good combinations of filters? If so, then this
>>might be a near-perfect application for Common Lisp...

> I think the problem is that something has to score the results, and
> that requires a human to look at the spam and nonspam, and that
> defeats the object.  I don't get enough spam that I can't filter it
> fairly easily by eye, but I do get enough (and a high enough ratio of
> spam/nonspam) that I'm almost no longer willing to.

I think that Apple Mail and Mozilla/Thunderbird have managed to make
this a relatively unobtrusive process. In my case, it took about 4 weeks
of training Apple Mail, and then it was pretty reliable in
distinguishing spam from other mail.

The trick is to make the classification as simple as clicking the delete
button, so that it fits with your regular workflow. I wouldn't have
accepted a distinct classification step.

On the other hand, I receive quite a high number of spam mails, so this
maybe made the process more effective.

Pascal

--
Tyler: "How's that working out for you?"
Jack: "Great."
Tyler: "Keep it up, then."


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingvar Mattsson  
View profile  
 More options Jan 20 2004, 11:47 am
Newsgroups: comp.lang.lisp
From: Ingvar Mattsson <ing...@cathouse.bofh.se>
Date: 20 Jan 2004 16:47:14 +0000
Local: Tues, Jan 20 2004 11:47 am
Subject: Re: [OT] Re: Spam

What I did for a while before "trusting" the classification by the
Bayesian filter was (a) classify saved non-spam mailboxes as "not
spam", then moving everything that was spam to a "spam" folder and
after a few days classifying that as "spam". Slightly more than "just
click a button", but the extra work is the typing of two commands
(sa-learn --spam --dir ../spam && rm ../spam/*).

I could, I guess, have added a "Spam" and a "Ham" button to my mail
client and done it that way, but after having spent a whole 2-3
minutes, I decided that the amortised effort probably wasn't worth
it. As usual, your mileage may vary.

//Ingvar
--
(defmacro fakelambda (args &body body) `(labels ((me ,args ,@body)) #'me))
(funcall (fakelambda (a b) (if (zerop (length a)) b (format nil "~a~a"
 (aref a 0) (me b (subseq a 1))))) "Js nte iphce" "utaohrls akr")


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tage Stabell-Kulø  
View profile  
 More options Jan 20 2004, 2:43 pm
Newsgroups: comp.lang.lisp
From: t...@ifi.uit.no (Tage Stabell-Kulø)
Date: 20 Jan 2004 20:43:56 +0100
Local: Tues, Jan 20 2004 2:43 pm
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> I think the problem is that something has to score the results, and
> that requires a human to look at the spam and nonspam, and that
> defeats the object.

Before I returned to read email in emacs, I used the Cloudmark
service embedded in Outlook.  Basically, it lets you mark email as
spam, and send a MD5 of the email to them.  All mail marked by other
as spam are moved to the smap folder.

Or, in other words, only one single user will (have to see) each
individual spam.  On a normal day it would remove about 95% of all my
spam, and I'd do the remaining 5% as my part of the effort.  I receive
100+ spam per day.  I would go through my spam box every week before
deleting.  Not even once during my years as a user did I find that
Cloudmark had been mistaken!

Cloudmark works on the assumption that the body of spam is identical
for all users.  As long as it is, the approach will work if you have a
large enough user community.

According to their homepage, sendmail.com has choosen their solution
as part of their commercial offering.  Unfortunately, nothing to be
bought from sendmail.com for NetBSD, so that's a no-go for me.

> --tim

 [TaSK@/\\]

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thomas F. Burdick  
View profile  
 More options Jan 20 2004, 3:28 pm
Newsgroups: comp.lang.lisp
From: t...@famine.OCF.Berkeley.EDU (Thomas F. Burdick)
Date: 20 Jan 2004 12:28:37 -0800
Local: Tues, Jan 20 2004 3:28 pm
Subject: Re: Spam

My problem is that it's pretty easy to get SpamAssassin to recognize
spam, but even with years of legit email to work with, I can't get a
setup that doesn't mark large amounts of my legit email as spam.  I
was able to get SA to mark only about 2% of my legit email as spam,
and let through about 50% of the spam I get, but it involved filtering
for viruses and known-good senders first.  So, I was able to get
things setup so I only wade through the amount of spam I did a year
ago.

I'm guessing that the writers of spam filters mostly correspond with
people who write in more-or-less complete sentances.  But I don't want
to stop writing to my friends who sometimes send me emails like Yo
son, CHECK THIS SHIT OUT $$$ <url> fuk dat track is HOT!!

--
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                              
   |     ) |                              
  (`-.  '--.)                              
   `. )----'                              


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Spam [Was: Newbie: floating point optimization]" by Thomas A. Russ
Thomas A. Russ  
View profile  
 More options Jan 20 2004, 3:40 pm
Newsgroups: comp.lang.lisp
From: t...@sevak.isi.edu (Thomas A. Russ)
Date: 20 Jan 2004 11:52:24 -0800
Local: Tues, Jan 20 2004 2:52 pm
Subject: Re: Spam [Was: Newbie: floating point optimization]

  There are still spammers who send mail to messages-IDs dating back to
  1994, and the SGML mailboxes continue to receive lots and lots of junk
  despite being discontinued in 1996.  It is about one third of what my
  1987-vintage University of Oslo address receives, but still, I'm quite
  impressed with the persistance of these drooling idiots.

Actually, this doesn't seem all that surprising, given some of the
economics of the spam industry.  The upstream suppliers like to sell
EMail address lists with the largest number of addresses in them, so
they really have absolutely NO incentive to try to cull addresses once
they have been added.  As far as I know, there aren't any rating groups
out there who analyze the degree of usefulness of those address lists :)

--
Thomas A. Russ,  USC/Information Sciences Institute


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Spam" by Petter Gustad
Petter Gustad  
View profile  
 More options Jan 20 2004, 5:00 pm
Newsgroups: comp.lang.lisp
From: Petter Gustad <newsmailco...@gustad.com>
Date: 20 Jan 2004 22:58:38 +0100
Local: Tues, Jan 20 2004 4:58 pm
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> Me too.  I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is

Maybe you could use comp.lang.lisp to train it?

I'm using mew (an emacs mh mail client) and it has a very handy
command to train spam (ls) and ham (lh) in conjunction with bogofilter
(Bayesian). However, the problem lately is the high ratio of Bayesian
poison that I get (which I just file in my spam folder without any
further training).

Are there any good filters for detecting the Bayesian-poison nonsense?

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erann Gat  
View profile  
 More options Jan 20 2004, 5:57 pm
Newsgroups: comp.lang.lisp
From: gNOSPA...@jpl.nasa.gov (Erann Gat)
Date: Tue, 20 Jan 2004 14:33:24 -0800
Local: Tues, Jan 20 2004 5:33 pm
Subject: Re: Spam
In article <87ektus2nl....@filestore.home.gustad.com>, Petter Gustad

<newsmailco...@gustad.com> wrote:
> Are there any good filters for detecting the Bayesian-poison nonsense?

I haven't actually tried this, but all the Bayes-poison I've gotten is
just random words, so I would think it would succumb easily to a histogram
test.  I don't think you'd have to get very sophisticated.  Just looking
at the ratio of short words (<= 3 characters) to the total number of words
seems like it ought to work.  I predict this ratio would be around 0.1 for
non-spam, and <0.01 for Bayes-poison spam.

E.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Jan 20 2004, 6:35 pm
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: Tue, 20 Jan 2004 23:34:30 +0000
Local: Tues, Jan 20 2004 6:34 pm
Subject: Re: Spam
* Thomas F Burdick wrote:

> My problem is that it's pretty easy to get SpamAssassin to recognize
> spam, but even with years of legit email to work with, I can't get a
> setup that doesn't mark large amounts of my legit email as spam.  

Right, that's my problem exactly.  In fact it's not really surprising,
since well over 99% of my mail *is* spam (or was in week 3 or 2004,
anyway), so it essentially doesn't see any good mail at all.  I think
that this is what will doom bayesian things (and possibly any
statistical approach): statistically *all* mail will be spam quite
soon (and it is already for some people) so any training data they get
will be hopelessly skewed.  Although someone with a better grasp of
statistics than me can probably demonstrate that this is OK in fact.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Lavoie  
View profile  
 More options Jan 20 2004, 6:57 pm
Newsgroups: comp.lang.lisp
From: Tim Lavoie <tool_...@spamcop.net>
Date: Tue, 20 Jan 2004 17:56:09 -0600
Local: Tues, Jan 20 2004 6:56 pm
Subject: Re: Spam

>>>>> "Erann" == Erann Gat <gNOSPA...@jpl.nasa.gov> writes:

    Erann> In article <87ektus2nl....@filestore.home.gustad.com>,
    Erann> Petter Gustad
    Erann> <newsmailco...@gustad.com> wrote:

    >> Are there any good filters for detecting the Bayesian-poison
    >> nonsense?
    >>

    Erann> I haven't actually tried this, but all the Bayes-poison
    Erann> I've gotten is just random words, so I would think it would
    Erann> succumb easily to a histogram test.  I don't think you'd
    Erann> have to get very sophisticated.  Just looking at the ratio
    Erann> of short words (<= 3 characters) to the total number of
    Erann> words seems like it ought to work.  I predict this ratio
    Erann> would be around 0.1 for non-spam, and <0.01 for
    Erann> Bayes-poison spam.

Better yet, just use the filing of good or bad as further training
data. I get the occasional escapee with poisoning attempts, but it
just gets sent to the spam pile, and trained as spam for next time.

Even with poisoning, I get relatively little delivered to my main
mailbox, so the filtering is doing well. (spambayes, if it matters)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thomas F. Burdick  
View profile  
 More options Jan 20 2004, 7:12 pm
Newsgroups: comp.lang.lisp
From: t...@famine.OCF.Berkeley.EDU (Thomas F. Burdick)
Date: 20 Jan 2004 16:12:38 -0800
Local: Tues, Jan 20 2004 7:12 pm
Subject: Re: Spam

Tim Lavoie <tool_...@spamcop.net> writes:
> Better yet, just use the filing of good or bad as further training
> data. I get the occasional escapee with poisoning attempts, but it
> just gets sent to the spam pile, and trained as spam for next time.

> Even with poisoning, I get relatively little delivered to my main
> mailbox, so the filtering is doing well. (spambayes, if it matters)

The problem is, you're training it to *recognize* the poison, but not
necessarily to distinguish it from less-than-coherent legit email.

--
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                              
   |     ) |                              
  (`-.  '--.)                              
   `. )----'                              


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Harald Hanche-Olsen  
View profile  
 More options Jan 20 2004, 7:16 pm
Newsgroups: comp.lang.lisp
From: Harald Hanche-Olsen <han...@math.ntnu.no>
Date: 21 Jan 2004 00:40:48 +0100
Local: Tues, Jan 20 2004 6:40 pm
Subject: Re: Spam
+ Tim Bradshaw <t...@cley.com>:

| Me too.  I've struggled with a Bayesian thing but it really doesn't
| cope: I think it is failing because there isn't enough good mail to
| train it on (I get hardly any `real' mail: probably 1% of my mail is
| real), so it essentially classifies everything as spam, which I then
| have to wade through.  I may try feeding it my whole mailbox as good
| to give it some more data to learn from, but that's only a couple of
| days worth of spam now, so I'm not sure if it will work.

I have good experiences with spambayes.  When I first set it up, I
trained it on about 1000 hams and several thousand spams that I had
squirreled away for just such an opportunity.  Then I tested it on all
the mail I had just trained it on, and discovered about a dozen
messages misclassified in each category.  Now, my daily haul is
roughly 300 spam, 10 ham, and 3 unsure.  I hand classify the unsure
ones and train on them, otherwise I do no more training.  In eight
months operation I had very few hams misclassified as spam.  But then
I am perhaps lucky, living in the academic world.  Those who live in
the business world and need to distinguish between sales pitches that
are not spam and those that are, may have less luck.

--
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
  than thinking does: but it deprives us of whatever chance there is
  of getting closer to the truth.  -- C.P. Snow


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Jan 21 2004, 12:35 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: Tue, 20 Jan 2004 23:57:28 +0000
Local: Tues, Jan 20 2004 6:57 pm
Subject: Re: Spam

* Erann At wrote:
> I haven't actually tried this, but all the Bayes-poison I've gotten is
> just random words, so I would think it would succumb easily to a histogram
> test.  I don't think you'd have to get very sophisticated.  Just looking
> at the ratio of short words (<= 3 characters) to the total number of words
> seems like it ought to work.  I predict this ratio would be around 0.1 for
> non-spam, and <0.01 for Bayes-poison spam.

I presume that what will happen is an arms race here: the obvious
approach to dealing with this kind of random-word stuff is to look at
bigram (or in general n-gram) statistics.  I'm fairly sure that the
bigram stats of the kind of bayesian-poison stuff I see is nothing
like English.  So something that does bigrams should easily be able to
distinguish.  But unfortunately it's just as easy to write things
which *generate* text using bigram statistics from a bunch of input
data, and of course that output will look exactly like English to a
bigram-stats filter.  Anything longer than bigrams starts needing lots
of training data I think.

(Or indeed: Input data I think.  Gram statistics; Just as easy to a
bigram stats filter.  The bigram stats filter.  Obvious approach to a
bigram statistics from a bigram statistics from a bigram statistics;
I'm fairly sure that does bigrams starts needing lots of the kind of
bayesian poison stuff I see is nothing like English.)

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pascal Bourguignon  
View profile  
 More options Jan 21 2004, 12:45 am
Newsgroups: comp.lang.lisp
From: Pascal Bourguignon <s...@thalassa.informatimago.com>
Date: 21 Jan 2004 06:44:15 +0100
Local: Wed, Jan 21 2004 12:44 am
Subject: Re: Spam

I've done  a test with  bigrams and trigrams:  the gain to  bigram was
sizeable, but from bigrams to  trigrams, there was not much difference
on the data set I used (with  mailboxes from 1000 to 4000 spam or good
messages).

--
__Pascal_Bourguignon__                     http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Jan 21 2004, 1:42 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 21 Jan 2004 06:42:36 +0000
Local: Wed, Jan 21 2004 1:42 am
Subject: Re: Spam
* Tim Bradshaw
| (Or indeed: Input data I think.  Gram statistics; Just as easy to a
| bigram stats filter.  The bigram stats filter.  Obvious approach to a
| bigram statistics from a bigram statistics from a bigram statistics;
| I'm fairly sure that does bigrams starts needing lots of the kind of
| bayesian poison stuff I see is nothing like English.)

  The natural extension of this method is to use the spell checker on
  incoming mail.

--
Erik Naggum | Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matthias  
View profile  
 More options Jan 21 2004, 4:25 am
Newsgroups: comp.lang.lisp
From: Matthias <n...@spam.pls>
Date: 21 Jan 2004 10:25:50 +0100
Local: Wed, Jan 21 2004 4:25 am
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> * Thomas F Burdick wrote:

> > My problem is that it's pretty easy to get SpamAssassin to recognize
> > spam, but even with years of legit email to work with, I can't get a
> > setup that doesn't mark large amounts of my legit email as spam.  

> Right, that's my problem exactly.  In fact it's not really surprising,
> since well over 99% of my mail *is* spam (or was in week 3 or 2004,
> anyway), so it essentially doesn't see any good mail at all.  I think
> that this is what will doom bayesian things (and possibly any
> statistical approach): statistically *all* mail will be spam quite
> soon (and it is already for some people) so any training data they get
> will be hopelessly skewed.  Although someone with a better grasp of
> statistics than me can probably demonstrate that this is OK in fact.

The concept you might be looking for is "extreme value statistics"
which is the statistics of rare events.  It is used by assurance
companies to model events like earth quakes and hurricanes.  Extreme
value statistics is often hard to do, because you need to model the
tails of distributions accurately.  I'm no expert here, so I can't
tell more.  (But google has lots of references.)

So far statistical methods seem relatively successful approaches to
model human language (compared to the alternatives).  The problem with
Bayesian filters is not that they are using Bayes' rule but that,
currently, their model of human language is crude (individual words
are assumed statistically independent).

BTW: For me, the current language model works fine: My post-box
obtains approx 1000 emails a month, 2/3 of which are spam.  Bogofilter
does a very nice job with filtering (much better than spamassassin's
set of rules).  OTH, I don't trust it enough to put my e-mail address
on usenet. ;)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingvar Mattsson  
View profile  
 More options Jan 21 2004, 5:32 am
Newsgroups: comp.lang.lisp
From: Ingvar Mattsson <ing...@cathouse.bofh.se>
Date: 21 Jan 2004 10:32:50 +0000
Local: Wed, Jan 21 2004 5:32 am
Subject: Re: Spam

Grave mismatch between the text/plain and the text/html non-marked-up
text? That takes care of stuff with "You have received a mail in HTML"
too (if people can't write plain text emails, I don't want them).

//Ingvar
--
(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
(loop for x being the external-symbols in :cl collect (string x)) #'string<))


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Jan 21 2004, 6:35 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: Wed, 21 Jan 2004 10:48:36 +0000
Local: Wed, Jan 21 2004 5:48 am
Subject: Re: Spam

* Pascal Bourguignon wrote:
> I've done  a test with  bigrams and trigrams:  the gain to  bigram was
> sizeable, but from bigrams to  trigrams, there was not much difference
> on the data set I used (with  mailboxes from 1000 to 4000 spam or good
> messages).

This is very likely because you don't have enough data to get
reasonable trigram stats: you really need a lot.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Jan 21 2004, 6:35 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: Wed, 21 Jan 2004 10:59:57 +0000
Local: Wed, Jan 21 2004 5:59 am
Subject: Re: Spam

* Erik Naggum wrote:
>   The natural extension of this method is to use the spell checker on
>   incoming mail.

I think I may be confused by what people mean by `Bayesian poison'.
Is it all the stuff which has non-words in it, like v1agra &c?  In
that case I think that something which had a decent sized dictionary
and looked for more than n% known words would probably be good.  

The stuff I was worrying about was things which have lots of random,
English, words in, possibly with correct single-word stats, so
anything that uses a completely naive single-word model of language
will assume it's English, but the bigram stats will be completely
mutant.  A spelling checker doesn't help with this, I think.  I also
think an ngram thing can be fooled because it's so easy to generate.

Another approach might be to use a PoS tagger, and then look at the
nouns to see what *they* look like (or may be other parts of speach).

But even here, I don't know.  I have seen (or possibly dreamed) stuff
which has great chunks of out-of-copyright novels in, but is
constructed such that if you look at it with whatever tool its aimed
at, you don't see that (sometimes because its comments, but sometimes
because its white text on a white background). So a tool needs to be
able to work out what IE would do with it, which is a hideous
problem. (Actually, its an easy problem: it it's not plain text, it's
spam.  So spam has essentially destroyed any kind of rich content in
email).

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Espen Vestre  
View profile  
 More options Jan 21 2004, 6:46 am
Newsgroups: comp.lang.lisp
From: Espen Vestre <espen@*do-not-spam-me*.vestre.net>
Date: Wed, 21 Jan 2004 12:46:08 +0100
Local: Wed, Jan 21 2004 6:46 am
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> I think I may be confused by what people mean by `Bayesian poison'.
> Is it all the stuff which has non-words in it, like v1agra &c?

Hmm? I thought it was all the stuff that had real words in it?

Here's a cut and paste from a recently received spam:

"album howe vindicate arabic deafen decay twiddle hierarchy smalley
backlash luminous midwestern bivariate abstracter kinshasha negotiable
lawbreaker centerline backspace cranky struck consumption"

--
  (espen)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christophe Rhodes  
View profile  
 More options Jan 21 2004, 7:06 am
Newsgroups: comp.lang.lisp
From: Christophe Rhodes <cs...@cam.ac.uk>
Date: Wed, 21 Jan 2004 12:04:44 +0000
Local: Wed, Jan 21 2004 7:04 am
Subject: Re: Spam

Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:
> Tim Bradshaw <t...@cley.com> writes:

>> I think I may be confused by what people mean by `Bayesian poison'.
>> Is it all the stuff which has non-words in it, like v1agra &c?

> Hmm? I thought it was all the stuff that had real words in it?

> Here's a cut and paste from a recently received spam:

> "album howe vindicate arabic deafen decay twiddle hierarchy smalley
> backlash luminous midwestern bivariate abstracter kinshasha negotiable
> lawbreaker centerline backspace cranky struck consumption"

And my impression was that this isn't designed to defeat Bayesian spam
detectors, because the above words in any individual's corpus of ham
will feature very rarely if at all.

A reasonable assumption might be that one or two of these words have
appeared in a small number of ham messages.  They will therefore
contribute a small amount to the haminess of the message.  However,
the payload will generally have many strong indications of spam, and
consequently, at least here, the message is still filed as spam.

This filing as spam, in addition, retrains the Bayesian filter, such
that e.g. "hierarchy" becomes a slightly less strong indication of
spam.  So be it.  If "hierarchy" is a commonly-used word in your ham
(as it has been for me, as I did some work on the cosmological
hierarchy problem -- which incidentally means that "heirarchy" is also
a ham word for me... :-) then this slight retraining has no effect on
truly ham messages, which are still classified as ham; nor does it
have any effect on the next set of random words in spam, which is
unlikely to use "hierarchy" again.  So all in all, these random words
don't seem designed to defeat adaptive Bayesian filters.

What they do defeat, of course, is a SpamAssassin rule such as one
matching a MIME message with no text/plain component, without
triggering a rule detecting obvious spam words.  If I were attempting
to defeat Bayesian filters, more common words would seem more likely
to act as poison.

Christophe
--
http://www-jcsu.jesus.cam.ac.uk/~csr21/       +44 1223 510 299/+44 7729 383 757
(set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b)))
(defvar b "~&Just another Lisp hacker~%")    (pprint #36rJesusCollegeCambridge)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pascal Bourguignon  
View profile  
 More options Jan 21 2004, 7:11 am
Newsgroups: comp.lang.lisp
From: Pascal Bourguignon <s...@thalassa.informatimago.com>
Date: 21 Jan 2004 13:09:43 +0100
Local: Wed, Jan 21 2004 7:09 am
Subject: Re: Spam

Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:
> Tim Bradshaw <t...@cley.com> writes:

> > I think I may be confused by what people mean by `Bayesian poison'.
> > Is it all the stuff which has non-words in it, like v1agra &c?

> Hmm? I thought it was all the stuff that had real words in it?

> Here's a cut and paste from a recently received spam:

> "album howe vindicate arabic deafen decay twiddle hierarchy smalley
> backlash luminous midwestern bivariate abstracter kinshasha negotiable
> lawbreaker centerline backspace cranky struck consumption"

Strangely enough, none of these words appear in my normal email.
Perhaps if I was a nuclear physicist, "decay" would appear sometimes...

--
__Pascal_Bourguignon__                     http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hannah Schroeter  
View profile  
 More options Jan 21 2004, 8:32 am
Newsgroups: comp.lang.lisp
From: han...@schlund.de (Hannah Schroeter)
Date: 21 Jan 2004 13:30:17 GMT
Local: Wed, Jan 21 2004 8:30 am
Subject: Re: Spam
Hello!

Espen Vestre  <espen@*do-not-spam-me*.vestre.net> wrote:

>[...]
>"album howe vindicate arabic deafen decay twiddle hierarchy smalley
>backlash luminous midwestern bivariate abstracter kinshasha negotiable
>lawbreaker centerline backspace cranky struck consumption"

In fact, on my filter database, those words would end up as
spamwords, too, as they are quite randomly selected while my
hamwords are of course adapted to the mail traffic *I* individually
receive (a great part German words, but in both English and German,
biased through my interests).

And even if every piece of spam contains different such words,
they are at least not strongly "ham", and don't influence the
final decision, as that is drawn from the most significant
words.

Kind regards,

Hannah.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Marshall  
View profile  
 More options Jan 21 2004, 10:47 am
Newsgroups: comp.lang.lisp
From: Joe Marshall <j...@ccs.neu.edu>
Date: Wed, 21 Jan 2004 10:47:34 -0500
Local: Wed, Jan 21 2004 10:47 am
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> I presume that what will happen is an arms race here: the obvious
> approach to dealing with this kind of random-word stuff is to look at
> bigram (or in general n-gram) statistics.  

It depends on the hack they are using.

Some spam I get has random character strings in it like this:

    Bad Credit is OK Gold Visa Card rjkarcmrosfn f

others have pseudowords:

    astrologum Palatioque big chance

some have random words:

    Authors, you decide ........picosecond
    why wait?      ...................       dakar

some seem to take a sample of canned text:

    Rolex-Italian crafted from $65-$65-$140 Free
    Ship<!--extravagance. And when did you arrive here? inquired
    she. -->

and some seem to sample text from other sources:

    Bechtel nod to L&T raises Indian hopes on Iraq projects

I have a regular expression that matches non-english trigrams.  It
finds the random character strings without a problem.  Pseudowords
tend to have non-english trigrams in them (for instance the trigram
"ioq" is not common in English).  The other stuff gets more difficult.

The random words should be relatively easy statistically.  While I
might actually talk about picoseconds, the bulk of the words in the
english language are not ones I commonly use.  (How often do you write
about hippodromes, speakeasys, and peristalses?)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Marshall  
View profile  
 More options Jan 21 2004, 11:00 am
Newsgroups: comp.lang.lisp
From: Joe Marshall <j...@ccs.neu.edu>
Date: Wed, 21 Jan 2004 11:00:08 -0500
Local: Wed, Jan 21 2004 11:00 am
Subject: Re: Spam

Tim Bradshaw <t...@cley.com> writes:
> Me too.  I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is
> real), so it essentially classifies everything as spam, which I then
> have to wade through.  I may try feeding it my whole mailbox as good
> to give it some more data to learn from, but that's only a couple of
> days worth of spam now, so I'm not sure if it will work.

A pre-filter before the Bayesian filter is helpful.  Bayesian filters
tend to `latch on' to things that are easy to find statistically to
the detriment of those things that are more difficult.  There is no
need to train the filter to remove email from the .BIZ domain, or
email that claims to be a reply, but doesn't have a `References'
header.

In order to get the best results you need to have a model of spam and
ham.  It would be worthwhile to determine how the various popular spam
engines work so as to detect the engine itself rather than the message
within the spam.  For instance, one spam engine generates random
pseudohtml:  <oeaun><snauthsbm>

Another inserts text in the middle of words:  V<!-- oblong -->ia<!--
interest -->gra

You don't need to statistically match these, just detect them.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ray Dillinger  
View profile  
 More options Jan 21 2004, 12:35 pm
Newsgroups: comp.lang.lisp
From: Ray Dillinger <b...@sonic.net>
Date: Wed, 21 Jan 2004 17:33:48 GMT
Local: Wed, Jan 21 2004 12:33 pm
Subject: Re: Spam

Matthias wrote:

> statistical methods seem relatively successful approaches to
> model human language (compared to the alternatives).  The problem with
> Bayesian filters is not that they are using Bayes' rule but that,
> currently, their model of human language is crude (individual words
> are assumed statistically independent).

> BTW: For me, the current language model works fine: My post-box
> obtains approx 1000 emails a month, 2/3 of which are spam.  Bogofilter
> does a very nice job with filtering (much better than spamassassin's
> set of rules).  OTH, I don't trust it enough to put my e-mail address
> on usenet. ;)

These days I run a few simple checks on my mail -- basically
whitelists for known senders -- and then hand it over to ASSP's
bayesian filter.

ASSP, and several other new-generation filters, build a spamdb
of word pairs rather than individual words.  It makes the database
bigger, but hard drive space is cheap.

Some links:  

http://spamprobe.sourceforge.net/
http://assp.sourceforge.net/

The only thing that some people might dislike about it is that if
someone uses a free service that attaches a lot of spam to every
message that comes through (like Yahoo) it will treat their messages
as spam.  I am happy with this behavior and do not correct it.

I use ASSP.  I've had the same email address for ten years.  I post
to usenet using it.  'nuff said?  

                                Bear


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 26 - 50 of 93 < Older  Newer >
« Back to Discussions « Newer topic     Older topic »