>> SpamAssassin uses a number of filters, and add up the scores >>for these filters. Tuning and configuration of SpamAssassin is not >>straightforward; I wonder if it would be possible to apply genetic >>algorithms to find good combinations of filters? If so, then this >>might be a near-perfect application for Common Lisp...
> I think the problem is that something has to score the results, and > that requires a human to look at the spam and nonspam, and that > defeats the object. I don't get enough spam that I can't filter it > fairly easily by eye, but I do get enough (and a high enough ratio of > spam/nonspam) that I'm almost no longer willing to.
I think that Apple Mail and Mozilla/Thunderbird have managed to make this a relatively unobtrusive process. In my case, it took about 4 weeks of training Apple Mail, and then it was pretty reliable in distinguishing spam from other mail.
The trick is to make the classification as simple as clicking the delete button, so that it fits with your regular workflow. I wouldn't have accepted a distinct classification step.
On the other hand, I receive quite a high number of spam mails, so this maybe made the process more effective.
Pascal
-- Tyler: "How's that working out for you?" Jack: "Great." Tyler: "Keep it up, then."
Pascal Costanza <costa...@web.de> writes: > Tim Bradshaw wrote: > > * Raymond Wiker wrote:
> >> SpamAssassin uses a number of filters, and add up the scores > >>for these filters. Tuning and configuration of SpamAssassin is not > >>straightforward; I wonder if it would be possible to apply genetic > >>algorithms to find good combinations of filters? If so, then this > >>might be a near-perfect application for Common Lisp... > > I think the problem is that something has to score the results, and > > that requires a human to look at the spam and nonspam, and that > > defeats the object. I don't get enough spam that I can't filter it > > fairly easily by eye, but I do get enough (and a high enough ratio of > > spam/nonspam) that I'm almost no longer willing to.
> I think that Apple Mail and Mozilla/Thunderbird have managed to make > this a relatively unobtrusive process. In my case, it took about 4 > weeks of training Apple Mail, and then it was pretty reliable in > distinguishing spam from other mail.
> The trick is to make the classification as simple as clicking the > delete button, so that it fits with your regular workflow. I wouldn't > have accepted a distinct classification step.
What I did for a while before "trusting" the classification by the Bayesian filter was (a) classify saved non-spam mailboxes as "not spam", then moving everything that was spam to a "spam" folder and after a few days classifying that as "spam". Slightly more than "just click a button", but the extra work is the typing of two commands (sa-learn --spam --dir ../spam && rm ../spam/*).
I could, I guess, have added a "Spam" and a "Ham" button to my mail client and done it that way, but after having spent a whole 2-3 minutes, I decided that the amortised effort probably wasn't worth it. As usual, your mileage may vary.
//Ingvar -- (defmacro fakelambda (args &body body) `(labels ((me ,args ,@body)) #'me)) (funcall (fakelambda (a b) (if (zerop (length a)) b (format nil "~a~a" (aref a 0) (me b (subseq a 1))))) "Js nte iphce" "utaohrls akr")
Tim Bradshaw <t...@cley.com> writes: > I think the problem is that something has to score the results, and > that requires a human to look at the spam and nonspam, and that > defeats the object.
Before I returned to read email in emacs, I used the Cloudmark service embedded in Outlook. Basically, it lets you mark email as spam, and send a MD5 of the email to them. All mail marked by other as spam are moved to the smap folder.
Or, in other words, only one single user will (have to see) each individual spam. On a normal day it would remove about 95% of all my spam, and I'd do the remaining 5% as my part of the effort. I receive 100+ spam per day. I would go through my spam box every week before deleting. Not even once during my years as a user did I find that Cloudmark had been mistaken!
Cloudmark works on the assumption that the body of spam is identical for all users. As long as it is, the approach will work if you have a large enough user community.
According to their homepage, sendmail.com has choosen their solution as part of their commercial offering. Unfortunately, nothing to be bought from sendmail.com for NetBSD, so that's a no-go for me.
Tim Bradshaw <t...@cley.com> writes: > * Ingvar Mattsson wrote:
> > FWIW, I used to "not have a spam problem" (that is, the ratio of spam > > to legitimate email was low enough that the spam was not a > > distraction). In the last 3-4 months, it's gone to spam being the > > *major* part of my mailbox. Addresses ending up there have been in > > existence (and used on Usenet) from 1993, 1996 and somewhere around > > 1999 (and one only since last year).
> Me too. I've struggled with a Bayesian thing but it really doesn't > cope: I think it is failing because there isn't enough good mail to > train it on (I get hardly any `real' mail: probably 1% of my mail is > real), so it essentially classifies everything as spam, which I then > have to wade through. I may try feeding it my whole mailbox as good > to give it some more data to learn from, but that's only a couple of > days worth of spam now, so I'm not sure if it will work.
My problem is that it's pretty easy to get SpamAssassin to recognize spam, but even with years of legit email to work with, I can't get a setup that doesn't mark large amounts of my legit email as spam. I was able to get SA to mark only about 2% of my legit email as spam, and let through about 50% of the spam I get, but it involved filtering for viruses and known-good senders first. So, I was able to get things setup so I only wade through the amount of spam I did a year ago.
I'm guessing that the writers of spam filters mostly correspond with people who write in more-or-less complete sentances. But I don't want to stop writing to my friends who sometimes send me emails like Yo son, CHECK THIS SHIT OUT $$$ <url> fuk dat track is HOT!!
-- /|_ .-----------------------. ,' .\ / | No to Imperialist war | ,--' _,' | Wage class war! | / / `-----------------------' ( -. | | ) | (`-. '--.) `. )----'
There are still spammers who send mail to messages-IDs dating back to 1994, and the SGML mailboxes continue to receive lots and lots of junk despite being discontinued in 1996. It is about one third of what my 1987-vintage University of Oslo address receives, but still, I'm quite impressed with the persistance of these drooling idiots.
Actually, this doesn't seem all that surprising, given some of the economics of the spam industry. The upstream suppliers like to sell EMail address lists with the largest number of addresses in them, so they really have absolutely NO incentive to try to cull addresses once they have been added. As far as I know, there aren't any rating groups out there who analyze the degree of usefulness of those address lists :)
-- Thomas A. Russ, USC/Information Sciences Institute
Tim Bradshaw <t...@cley.com> writes: > Me too. I've struggled with a Bayesian thing but it really doesn't > cope: I think it is failing because there isn't enough good mail to > train it on (I get hardly any `real' mail: probably 1% of my mail is
Maybe you could use comp.lang.lisp to train it?
I'm using mew (an emacs mh mail client) and it has a very handy command to train spam (ls) and ham (lh) in conjunction with bogofilter (Bayesian). However, the problem lately is the high ratio of Bayesian poison that I get (which I just file in my spam folder without any further training).
Are there any good filters for detecting the Bayesian-poison nonsense?
Petter -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
In article <87ektus2nl....@filestore.home.gustad.com>, Petter Gustad
<newsmailco...@gustad.com> wrote: > Are there any good filters for detecting the Bayesian-poison nonsense?
I haven't actually tried this, but all the Bayes-poison I've gotten is just random words, so I would think it would succumb easily to a histogram test. I don't think you'd have to get very sophisticated. Just looking at the ratio of short words (<= 3 characters) to the total number of words seems like it ought to work. I predict this ratio would be around 0.1 for non-spam, and <0.01 for Bayes-poison spam.
> My problem is that it's pretty easy to get SpamAssassin to recognize > spam, but even with years of legit email to work with, I can't get a > setup that doesn't mark large amounts of my legit email as spam.
Right, that's my problem exactly. In fact it's not really surprising, since well over 99% of my mail *is* spam (or was in week 3 or 2004, anyway), so it essentially doesn't see any good mail at all. I think that this is what will doom bayesian things (and possibly any statistical approach): statistically *all* mail will be spam quite soon (and it is already for some people) so any training data they get will be hopelessly skewed. Although someone with a better grasp of statistics than me can probably demonstrate that this is OK in fact.
>>>>> "Erann" == Erann Gat <gNOSPA...@jpl.nasa.gov> writes:
Erann> In article <87ektus2nl....@filestore.home.gustad.com>, Erann> Petter Gustad Erann> <newsmailco...@gustad.com> wrote:
>> Are there any good filters for detecting the Bayesian-poison >> nonsense? >>
Erann> I haven't actually tried this, but all the Bayes-poison Erann> I've gotten is just random words, so I would think it would Erann> succumb easily to a histogram test. I don't think you'd Erann> have to get very sophisticated. Just looking at the ratio Erann> of short words (<= 3 characters) to the total number of Erann> words seems like it ought to work. I predict this ratio Erann> would be around 0.1 for non-spam, and <0.01 for Erann> Bayes-poison spam.
Better yet, just use the filing of good or bad as further training data. I get the occasional escapee with poisoning attempts, but it just gets sent to the spam pile, and trained as spam for next time.
Even with poisoning, I get relatively little delivered to my main mailbox, so the filtering is doing well. (spambayes, if it matters)
Tim Lavoie <tool_...@spamcop.net> writes: > Better yet, just use the filing of good or bad as further training > data. I get the occasional escapee with poisoning attempts, but it > just gets sent to the spam pile, and trained as spam for next time.
> Even with poisoning, I get relatively little delivered to my main > mailbox, so the filtering is doing well. (spambayes, if it matters)
The problem is, you're training it to *recognize* the poison, but not necessarily to distinguish it from less-than-coherent legit email.
-- /|_ .-----------------------. ,' .\ / | No to Imperialist war | ,--' _,' | Wage class war! | / / `-----------------------' ( -. | | ) | (`-. '--.) `. )----'
| Me too. I've struggled with a Bayesian thing but it really doesn't | cope: I think it is failing because there isn't enough good mail to | train it on (I get hardly any `real' mail: probably 1% of my mail is | real), so it essentially classifies everything as spam, which I then | have to wade through. I may try feeding it my whole mailbox as good | to give it some more data to learn from, but that's only a couple of | days worth of spam now, so I'm not sure if it will work.
I have good experiences with spambayes. When I first set it up, I trained it on about 1000 hams and several thousand spams that I had squirreled away for just such an opportunity. Then I tested it on all the mail I had just trained it on, and discovered about a dozen messages misclassified in each category. Now, my daily haul is roughly 300 spam, 10 ham, and 3 unsure. I hand classify the unsure ones and train on them, otherwise I do no more training. In eight months operation I had very few hams misclassified as spam. But then I am perhaps lucky, living in the academic world. Those who live in the business world and need to distinguish between sales pitches that are not spam and those that are, may have less luck.
-- * Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/> - Debating gives most of us much more psychological satisfaction than thinking does: but it deprives us of whatever chance there is of getting closer to the truth. -- C.P. Snow
* Erann At wrote: > I haven't actually tried this, but all the Bayes-poison I've gotten is > just random words, so I would think it would succumb easily to a histogram > test. I don't think you'd have to get very sophisticated. Just looking > at the ratio of short words (<= 3 characters) to the total number of words > seems like it ought to work. I predict this ratio would be around 0.1 for > non-spam, and <0.01 for Bayes-poison spam.
I presume that what will happen is an arms race here: the obvious approach to dealing with this kind of random-word stuff is to look at bigram (or in general n-gram) statistics. I'm fairly sure that the bigram stats of the kind of bayesian-poison stuff I see is nothing like English. So something that does bigrams should easily be able to distinguish. But unfortunately it's just as easy to write things which *generate* text using bigram statistics from a bunch of input data, and of course that output will look exactly like English to a bigram-stats filter. Anything longer than bigrams starts needing lots of training data I think.
(Or indeed: Input data I think. Gram statistics; Just as easy to a bigram stats filter. The bigram stats filter. Obvious approach to a bigram statistics from a bigram statistics from a bigram statistics; I'm fairly sure that does bigrams starts needing lots of the kind of bayesian poison stuff I see is nothing like English.)
Tim Bradshaw <t...@cley.com> writes: > * Erann At wrote:
> > I haven't actually tried this, but all the Bayes-poison I've gotten is > > just random words, so I would think it would succumb easily to a histogram > > test. I don't think you'd have to get very sophisticated. Just looking > > at the ratio of short words (<= 3 characters) to the total number of words > > seems like it ought to work. I predict this ratio would be around 0.1 for > > non-spam, and <0.01 for Bayes-poison spam.
> I presume that what will happen is an arms race here: the obvious > approach to dealing with this kind of random-word stuff is to look at > bigram (or in general n-gram) statistics. I'm fairly sure that the > bigram stats of the kind of bayesian-poison stuff I see is nothing > like English. So something that does bigrams should easily be able to > distinguish. But unfortunately it's just as easy to write things > which *generate* text using bigram statistics from a bunch of input > data, and of course that output will look exactly like English to a > bigram-stats filter. Anything longer than bigrams starts needing lots > of training data I think.
> (Or indeed: Input data I think. Gram statistics; Just as easy to a > bigram stats filter. The bigram stats filter. Obvious approach to a > bigram statistics from a bigram statistics from a bigram statistics; > I'm fairly sure that does bigrams starts needing lots of the kind of > bayesian poison stuff I see is nothing like English.)
I've done a test with bigrams and trigrams: the gain to bigram was sizeable, but from bigrams to trigrams, there was not much difference on the data set I used (with mailboxes from 1000 to 4000 spam or good messages).
-- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/
* Tim Bradshaw | (Or indeed: Input data I think. Gram statistics; Just as easy to a | bigram stats filter. The bigram stats filter. Obvious approach to a | bigram statistics from a bigram statistics from a bigram statistics; | I'm fairly sure that does bigrams starts needing lots of the kind of | bayesian poison stuff I see is nothing like English.)
The natural extension of this method is to use the spell checker on incoming mail.
-- Erik Naggum | Oslo, Norway
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
Tim Bradshaw <t...@cley.com> writes: > * Thomas F Burdick wrote:
> > My problem is that it's pretty easy to get SpamAssassin to recognize > > spam, but even with years of legit email to work with, I can't get a > > setup that doesn't mark large amounts of my legit email as spam.
> Right, that's my problem exactly. In fact it's not really surprising, > since well over 99% of my mail *is* spam (or was in week 3 or 2004, > anyway), so it essentially doesn't see any good mail at all. I think > that this is what will doom bayesian things (and possibly any > statistical approach): statistically *all* mail will be spam quite > soon (and it is already for some people) so any training data they get > will be hopelessly skewed. Although someone with a better grasp of > statistics than me can probably demonstrate that this is OK in fact.
The concept you might be looking for is "extreme value statistics" which is the statistics of rare events. It is used by assurance companies to model events like earth quakes and hurricanes. Extreme value statistics is often hard to do, because you need to model the tails of distributions accurately. I'm no expert here, so I can't tell more. (But google has lots of references.)
So far statistical methods seem relatively successful approaches to model human language (compared to the alternatives). The problem with Bayesian filters is not that they are using Bayes' rule but that, currently, their model of human language is crude (individual words are assumed statistically independent).
BTW: For me, the current language model works fine: My post-box obtains approx 1000 emails a month, 2/3 of which are spam. Bogofilter does a very nice job with filtering (much better than spamassassin's set of rules). OTH, I don't trust it enough to put my e-mail address on usenet. ;)
Petter Gustad <newsmailco...@gustad.com> writes: > Tim Bradshaw <t...@cley.com> writes:
> > Me too. I've struggled with a Bayesian thing but it really doesn't > > cope: I think it is failing because there isn't enough good mail to > > train it on (I get hardly any `real' mail: probably 1% of my mail is
> Maybe you could use comp.lang.lisp to train it?
> I'm using mew (an emacs mh mail client) and it has a very handy > command to train spam (ls) and ham (lh) in conjunction with bogofilter > (Bayesian). However, the problem lately is the high ratio of Bayesian > poison that I get (which I just file in my spam folder without any > further training).
> Are there any good filters for detecting the Bayesian-poison nonsense?
Grave mismatch between the text/plain and the text/html non-marked-up text? That takes care of stuff with "You have received a mail in HTML" too (if people can't write plain text emails, I don't want them).
* Pascal Bourguignon wrote: > I've done a test with bigrams and trigrams: the gain to bigram was > sizeable, but from bigrams to trigrams, there was not much difference > on the data set I used (with mailboxes from 1000 to 4000 spam or good > messages).
This is very likely because you don't have enough data to get reasonable trigram stats: you really need a lot.
* Erik Naggum wrote: > The natural extension of this method is to use the spell checker on > incoming mail.
I think I may be confused by what people mean by `Bayesian poison'. Is it all the stuff which has non-words in it, like v1agra &c? In that case I think that something which had a decent sized dictionary and looked for more than n% known words would probably be good.
The stuff I was worrying about was things which have lots of random, English, words in, possibly with correct single-word stats, so anything that uses a completely naive single-word model of language will assume it's English, but the bigram stats will be completely mutant. A spelling checker doesn't help with this, I think. I also think an ngram thing can be fooled because it's so easy to generate.
Another approach might be to use a PoS tagger, and then look at the nouns to see what *they* look like (or may be other parts of speach).
But even here, I don't know. I have seen (or possibly dreamed) stuff which has great chunks of out-of-copyright novels in, but is constructed such that if you look at it with whatever tool its aimed at, you don't see that (sometimes because its comments, but sometimes because its white text on a white background). So a tool needs to be able to work out what IE would do with it, which is a hideous problem. (Actually, its an easy problem: it it's not plain text, it's spam. So spam has essentially destroyed any kind of rich content in email).
Tim Bradshaw <t...@cley.com> writes: > I think I may be confused by what people mean by `Bayesian poison'. > Is it all the stuff which has non-words in it, like v1agra &c?
Hmm? I thought it was all the stuff that had real words in it?
Here's a cut and paste from a recently received spam:
And my impression was that this isn't designed to defeat Bayesian spam detectors, because the above words in any individual's corpus of ham will feature very rarely if at all.
A reasonable assumption might be that one or two of these words have appeared in a small number of ham messages. They will therefore contribute a small amount to the haminess of the message. However, the payload will generally have many strong indications of spam, and consequently, at least here, the message is still filed as spam.
This filing as spam, in addition, retrains the Bayesian filter, such that e.g. "hierarchy" becomes a slightly less strong indication of spam. So be it. If "hierarchy" is a commonly-used word in your ham (as it has been for me, as I did some work on the cosmological hierarchy problem -- which incidentally means that "heirarchy" is also a ham word for me... :-) then this slight retraining has no effect on truly ham messages, which are still classified as ham; nor does it have any effect on the next set of random words in spam, which is unlikely to use "hierarchy" again. So all in all, these random words don't seem designed to defeat adaptive Bayesian filters.
What they do defeat, of course, is a SpamAssassin rule such as one matching a MIME message with no text/plain component, without triggering a rule detecting obvious spam words. If I were attempting to defeat Bayesian filters, more common words would seem more likely to act as poison.
Christophe -- http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757 (set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b))) (defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)
Strangely enough, none of these words appear in my normal email. Perhaps if I was a nuclear physicist, "decay" would appear sometimes...
-- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/
In fact, on my filter database, those words would end up as spamwords, too, as they are quite randomly selected while my hamwords are of course adapted to the mail traffic *I* individually receive (a great part German words, but in both English and German, biased through my interests).
And even if every piece of spam contains different such words, they are at least not strongly "ham", and don't influence the final decision, as that is drawn from the most significant words.
Tim Bradshaw <t...@cley.com> writes: > I presume that what will happen is an arms race here: the obvious > approach to dealing with this kind of random-word stuff is to look at > bigram (or in general n-gram) statistics.
It depends on the hack they are using.
Some spam I get has random character strings in it like this:
Bad Credit is OK Gold Visa Card rjkarcmrosfn f
others have pseudowords:
astrologum Palatioque big chance
some have random words:
Authors, you decide ........picosecond why wait? ................... dakar
some seem to take a sample of canned text:
Rolex-Italian crafted from $65-$65-$140 Free Ship<!--extravagance. And when did you arrive here? inquired she. -->
and some seem to sample text from other sources:
Bechtel nod to L&T raises Indian hopes on Iraq projects
I have a regular expression that matches non-english trigrams. It finds the random character strings without a problem. Pseudowords tend to have non-english trigrams in them (for instance the trigram "ioq" is not common in English). The other stuff gets more difficult.
The random words should be relatively easy statistically. While I might actually talk about picoseconds, the bulk of the words in the english language are not ones I commonly use. (How often do you write about hippodromes, speakeasys, and peristalses?)
Tim Bradshaw <t...@cley.com> writes: > Me too. I've struggled with a Bayesian thing but it really doesn't > cope: I think it is failing because there isn't enough good mail to > train it on (I get hardly any `real' mail: probably 1% of my mail is > real), so it essentially classifies everything as spam, which I then > have to wade through. I may try feeding it my whole mailbox as good > to give it some more data to learn from, but that's only a couple of > days worth of spam now, so I'm not sure if it will work.
A pre-filter before the Bayesian filter is helpful. Bayesian filters tend to `latch on' to things that are easy to find statistically to the detriment of those things that are more difficult. There is no need to train the filter to remove email from the .BIZ domain, or email that claims to be a reply, but doesn't have a `References' header.
In order to get the best results you need to have a model of spam and ham. It would be worthwhile to determine how the various popular spam engines work so as to detect the engine itself rather than the message within the spam. For instance, one spam engine generates random pseudohtml: <oeaun><snauthsbm>
Another inserts text in the middle of words: V<!-- oblong -->ia<!-- interest -->gra
You don't need to statistically match these, just detect them.
> statistical methods seem relatively successful approaches to > model human language (compared to the alternatives). The problem with > Bayesian filters is not that they are using Bayes' rule but that, > currently, their model of human language is crude (individual words > are assumed statistically independent).
> BTW: For me, the current language model works fine: My post-box > obtains approx 1000 emails a month, 2/3 of which are spam. Bogofilter > does a very nice job with filtering (much better than spamassassin's > set of rules). OTH, I don't trust it enough to put my e-mail address > on usenet. ;)
These days I run a few simple checks on my mail -- basically whitelists for known senders -- and then hand it over to ASSP's bayesian filter.
ASSP, and several other new-generation filters, build a spamdb of word pairs rather than individual words. It makes the database bigger, but hard drive space is cheap.
The only thing that some people might dislike about it is that if someone uses a free service that attaches a lot of spam to every message that comes through (like Yahoo) it will treat their messages as spam. I am happy with this behavior and do not correct it.
I use ASSP. I've had the same email address for ten years. I post to usenet using it. 'nuff said?