New Paul Graham Article

36 views
Skip to first unread message

sv0f

unread,
Aug 16, 2002, 2:30:01 PM8/16/02
to
On a statistical approach to filtering spam, with Lisp code,
here:

http://www.paulgraham.com/spam.html

Discussion of this article is currently happening on Slashdot,
for the interested.

Christopher Browne

unread,
Aug 16, 2002, 9:25:40 PM8/16/02
to
In an attempt to throw the authorities off his trail, no...@vanderbilt.edu (sv0f) transmitted:

And if you want to work with an existing package that has been mature
for several years now, you might look at the URL below for "Ifile." I
helped tune it to become pretty fast.

And I have to disagree somewhat with Graham's article; Naive Bayesian
filtering _doesn't_ provide _quite_ as good results as he implies.
Having both "sex" and "sexy" in a message does _not_ guarantee at P >
0.99 that messages will get tossed into the "spam" category.

My statistics for those words in my corpus are thus:

sexy 4525 424:28 426:2 449:1 456:1

sex 62535 160:16 169:6 171:5 173:2 184:1 190:1 194:2 211:1 215:4 218:1
221:15 224:3 226:1 234:2 237:11 238:1 239:2 241:1 244:1 247:1 249:11
251:1 264:2 273:2 278:2 285:7 289:2 295:1 306:2 321:5 322:2 323:4
324:9 327:14 332:2 334:2 343:15 346:2 347:1 350:5 352:1 354:2 362:4
366:6 368:10 369:3 370:1 397:20 411:2 413:3 414:6 415:15 416:16 418:3
421:17 423:1 424:338 425:11 426:23 432:2 433:2 439:3 442:1 459:2 465:3

The "424:28" indicates that the word "sexy" occurred 28 times in
folder #424, which happens to be the "Spam/Phonesex" folder. #426 is
Spam/Snakeoil, #449 is X/Advocacy, with an instance of a quote about
people being "mesemerized by sexy glitz which distracts them from the
work at hand." #456 pointed to a .signature with the word "sexy."

Frankly, the word "sexy" is a very _useful_ one. (And looking at the
stats here has caused me to modify a couple email messages in my
archives, which will strengthen the result :-).)

Unfortunately, it's not only found in the "Phonesex" folder.
Instances are found here and there everywhere. And there are other
words that are very common both in "evil spam" and in everyday
conversation. Integrating the whole set of statistics together
requires adding up statistics for _all_ the words found in a message,
not just the words "sex" and "sexy."

My finding is that it is _nowhere_ near sufficient to have two
populations, "spam" versus "not spam."

If you muddle together the Nigerian Pyramid schemes with the "Penis
enhancement" ads along with the offers of new credit cards as well as
the latest sites where you can talk to "hot, horny girls LIVE!", the
statistics don't work out nearly so well.

It's hard to tell, on the face of it, why Nigerian scams _should_ be
considered textually similar to phone sex ads, and in practice, the
result of throwing them all together

I have my spam split into categories so that filtering is _even more
discriminatory_:

Credit
Foreign
Gambling
Investigators
Newsletters
Phonesex
Pyramid
Snakeoil
Viruses

There are a few things left to improve about Ifile, and I'd like to
redo it in some language fundamentally less painful to work with than
C The project I periodically consider is to redo the filtering
software in Lisp. Unfortunately, I wind up running into _tremendous_
bottlenecks each time I do so. Some combination of my skills and the
tools at hand prove not quite adequate. Maybe next time...
--
(concatenate 'string "chris" "@cbbrowne.com")
http://cbbrowne.com/info/mail.html#ifile
Out of my mind. Back in five minutes.

Erik Naggum

unread,
Aug 16, 2002, 11:51:05 PM8/16/02
to
* no...@vanderbilt.edu (sv0f)

| On a statistical approach to filtering spam, with Lisp code, here:

Spam has to be dealt with at the transport level. The ability of strangers
to send you mail must be curtailed. Several large sites offer a system to
reject all mail from unknown correspondents, temporarily or permanently, and
wait for the reader of the log to accept incoming mail from addresses that
look familiar. Another option is to accept delivery but return transport-
like error messages if the user does not want the message. Yet another
option is to see if the smtp client is set up to accept mail for the domain
that it tries to deliver mail from. Yet another option is to temporarily
reject all mail from unknown sources and utilize the fact that spammers have
no resources to queue messages for later delivery. And then you can always
implement a scheme that returns a temporary rejection, but sends a mail to
the originator independently asking for confirmation that he is human and by
accepting the conditions that unsolicited commercial e-mail carries a fee
that /will/ be collected. Failure to accept the conditions will cause the
temporary rejection never to be lifted, thus using up queue space in the
offending server, which any sysadmin will notice and take care of even if
they do not bother to fix their system configuration to avoid relaying spam.
Should the conditions be accepted, the message is allowed through.

If you allow the message to be delivered and waste CPU or brain time, the
spammers have won a small victory. That is just wrong. Spammers must die.

--
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

JB

unread,
Aug 17, 2002, 4:03:03 AM8/17/02
to
Erik Naggum wrote:

> If you allow the message to be delivered and waste CPU
> or brain time, the
> spammers have won a small victory. That is just wrong.
> Spammers must die.
>

The countermeasure you mention in you message should be
taken by the mail service provider. Otherwise I should have
to implement a mail client.

In my case the following happened: Immediately after I
started posting to newsgroups, I started getting mails in
which I was offered help with my debts or I was given
advice as to how to make certain parts of my body larger.

I did the following:

(1) I stopped appending a valid email address to my mails
(2) I set up several mail accounts. All but one contain my
initials in some way and there I sometimes still get spam.
But one account is well hidden and only my friends know it.
I never got spam there.

I think that first the users should agree upon spam being
evil. (There is no such agreement yet.) Then there should
be a law against spam. And then police action could be
taken.

--
Janos Blazi


-----------== Posted via Newsfeed.Com - Uncensored Usenet News ==----------
http://www.newsfeed.com The #1 Newsgroup Service in the World!
-----= Over 100,000 Newsgroups - Unlimited Fast Downloads - 19 Servers =-----

c hore

unread,
Aug 17, 2002, 4:58:38 AM8/17/02
to
> On a statistical approach to filtering spam, with Lisp code,
> here:
> http://www.paulgraham.com/spam.html

Most of the spam I receive seems to be images, presumably
to bypass text-based filters. I suppose you would have to
run character recognition first on an image before any
text filter, Bayesian or otherwise, could be applied?

AFS97209

unread,
Aug 17, 2002, 5:20:57 AM8/17/02
to
How effective is it in filtering out requsts from African govenments
to launder money?

Herb Martin

unread,
Aug 17, 2002, 7:01:45 AM8/17/02
to
> How effective is it in filtering out requsts from African govenments
> to launder money?

Apparently very effectic -- Graham discusses that
in specific.

But the key is that it is TUNED to the particular user
by running a pre-processor through both "good mail"
and "spam mail" databases.

The article is worth a quick read.

--
Herb Martin
Try ADDS for great Weather too:
http://adds.aviationweather.noaa.gov/projects/adds

"AFS97209" <afs9...@yahoo.com> wrote in message
news:6dfa3582.02081...@posting.google.com...

Herb Martin

unread,
Aug 17, 2002, 7:33:19 AM8/17/02
to
The article is worth a quick read.

There is also a FAQ listed at the bottom.

> How effective is it in filtering out requsts from African govenments
> to launder money?

Apparently very effectic -- Graham discusses that
in specific.

But the key is that it is TUNED to the particular user
by running a pre-processor through both "good mail"
and "spam mail" databases.

From the FAQ (someone in this thread asked about
graphics):

<quote from faq>
What if spammers sent their messages as images?

Such an email would include a lot of damning content,
actually. The headers, to start with, would be as bad
as ever. And remember that we scan all the html as
well as the text. Within the message body there would
probably be a link as well as the image, both containing
urls, which would probably score high. "Href" and "img"
themselves both have spam probabilities approaching
pornographic words.

<end quote from faq>


Herb Martin
Try ADDS for great Weather too:
http://adds.aviationweather.noaa.gov/projects/adds

> How effective is it in filtering out requsts from African govenments
> to launder money?

Apparently very effective -- Graham discusses that
in specific.

But the key is that it is TUNED to the particular user
by running a pre-processor through both "good mail"
and "spam mail" databases.

The article is worth a quick read.

--
Herb Martin
Try ADDS for great Weather too:
http://adds.aviationweather.noaa.gov/projects/adds

--
Herb Martin, PP-SEL
(...and aerobatic student)

"AFS97209" <afs9...@yahoo.com> wrote in message
news:6dfa3582.02081...@posting.google.com...

Herb Martin

unread,
Aug 17, 2002, 7:40:49 AM8/17/02
to
> And if you want to work with an existing package that has been mature
> for several years now, you might look at the URL below for "Ifile." I
> helped tune it to become pretty fast.

IFile's documentation and download page is
included at the end of Graham's article.

http://www.ai.mit.edu/~jrennie/ifile/

> And I have to disagree somewhat with Graham's article; Naive Bayesian
> filtering _doesn't_ provide _quite_ as good results as he implies.
> Having both "sex" and "sexy" in a message does _not_ guarantee at P >
> 0.99 that messages will get tossed into the "spam" category.

I am not certain of your 'naive' filtering usage with
the example of only "included" words. IFile's doc
page describes it's algorythm as "naive bayesian
filtering" as well.

Graham is using the words included in "good mail"
to counter this, as IFile seems to do.

Herb Martin
Try ADDS for great Weather too:
http://adds.aviationweather.noaa.gov/projects/adds

> And I have to disagree somewhat with Graham's article; Naive Bayesian

xah

unread,
Aug 17, 2002, 8:23:23 AM8/17/02
to
There are two lispy big wigs, namely Paul Graham and Erik Naggum, who thinks
their hotshot mouthing on spamming is something of value.

Their outpouring, is not unlike that of damming of drivel flood.

In the treatment of flood, there is the brute force of building of a dam.
This is the no-nonsense solution of brutes the likes of technology geeks.
Among them are the brute elite the likes of Graham'n'Naggum, who speak on
fine engineering far ahead of fellow brutes.

The phenomenon of spam is a human-social phenomenon. Spammers spam because
it is effective. Consumers'S mouths says no but their actions says yes,
because for the vast majority they are unthinking and happy-go-lucky brutes.
These brutes decline spam, but when they happen upon a spam that pleases
them, such as a great porno site advertisement or something else that caught
their personal interest, they will follow the advertisement. For spammers,
spam is effective if there is one response out of one thousand. Similarly,
brutes will respond to one out of one thousand hateful spams. The prospect
of spamming is thus kept alive by the populous brutes, everywhere, despite
tumultuous noises they makes.

As a human community or society, the treatment of spam is up to us, not
fucking technology. (fuck Erik Naggum. Fuck you. You got it?) The gist is
"what do we want?". (Note: not "what technology geeks want" or "what fucking
unix morons want".) If we do not want spam, there are two ways to get rid of
it: Thru law, and thru inaction. The former is a well-known proposal in the
process. The latter, is possible only if human are not lazy unthinking
beer-drinking brutes, which is never going to happen.

--
(gratuitous poem)

i will be building a dam, when there is a pussy flood.
for i am a collector of nature, and a lover of its treasure.
i will use my savings, to quench those thirsty, and lubricate those angry,
for i have suffered and suffered; the pain and anguish.
-- Xah Lee

I dedicate this poem to Erik Naggum.

PS i request that anyone who read so far and find it worthwhile to send me
an email with the line "Xah, you are beautiful.". Thanks in advance.

Xah
x...@xahlee.org
http://xahlee.org/PageTwo_dir/more.html


in article 32385450...@naggum.no, Erik Naggum at er...@naggum.no wrote
on 8/16/02 8:51 PM:

Christopher Browne

unread,
Aug 17, 2002, 9:32:12 AM8/17/02
to

No, it wouldn't be necessary.

If you have a population of messages that consist just of images,
that's going to bias the vocabulary statistics since there will be
lots of words like "multipart" and "alternative" and "jpeg", and very
few of the "legitimate" words that people use when they send you real
mail.

Remember, if this is being used well, you're not merely classifying
between "spam" and "not spam;" you're classifying into a multiplicity
of _legitimate_ categories, such as:

-> Mail from family members
-> Mail from this friend
-> Mail from that friend
-> Mail from the other friend
-> Email from "technical associates," by person
-> Email from mailing lists, arranged _by mailing list_
-> And so forth, for legitimate categories...

combined, preferably, with "spam" that gets classified so that you can
get finer discrimination

-> Pyramid scams
-> Credit card offers
-> Breast/Penis enhancements, Viagra ads, weight loss, stop smoking
plans, ...
-> Computer Viruses
and such.

The spam _isn't_ likely to have similar vocabulary to the email you
get from legitimate sources.

If something with totally new characteristics comes along, it may get
misfiled, at which point you move it to a more appropriate folder
(perhaps even a new folder), and it becomes part of the new corpus,
directing future similar spam to the right place.
--
(reverse (concatenate 'string "ac.notelrac.teneerf@" "454aa"))
http://www.ntlug.org/~cbbrowne/ifilter.html
Rules of the Evil Overlord #60. "My five-year-old child advisor will
also be asked to decipher any code I am thinking of using. If he
breaks the code in under 30 seconds, it will not be used. Note: this
also applies to passwords." <http://www.eviloverlord.com/>

Christopher Browne

unread,
Aug 17, 2002, 9:32:11 AM8/17/02
to
In the last exciting episode, afs9...@yahoo.com (AFS97209) wrote::

> How effective is it in filtering out requsts from African govenments
> to launder money?

Very much so. Those messages head to Spam/Pyramid and nowhere else.

The contents of the messages involve a set of vocabulary that are
quite repetitive between messages, so it's an _ideal_ candidate for
Naive Baysian networks.
--
(reverse (concatenate 'string "moc.enworbbc@" "sirhc"))
http://cbbrowne.com/info/spiritual.html
"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare

Christopher Browne

unread,
Aug 17, 2002, 10:31:42 AM8/17/02
to
A long time ago, in a galaxy far, far away, "Herb Martin" <He...@LearnQuick.Com> wrote:
>> And if you want to work with an existing package that has been mature
>> for several years now, you might look at the URL below for "Ifile." I
>> helped tune it to become pretty fast.
>
> IFile's documentation and download page is
> included at the end of Graham's article.
>
> http://www.ai.mit.edu/~jrennie/ifile/
>
>> And I have to disagree somewhat with Graham's article; Naive Bayesian
>> filtering _doesn't_ provide _quite_ as good results as he implies.
>> Having both "sex" and "sexy" in a message does _not_ guarantee at P >
>> 0.99 that messages will get tossed into the "spam" category.
>
> I am not certain of your 'naive' filtering usage with the example of
> only "included" words. IFile's doc page describes it's algorythm as
> "naive bayesian filtering" as well.
>
> Graham is using the words included in "good mail" to counter this,
> as IFile seems to do.

The point is that _all_ the words in the message are considered.

For instance, if I throw my message, which conspicuously contains both
the word "sex" and the word "sexy," purportedly surefire indications
of spam, at ifile, the fact that it mentions Ifile several times means
that it heads to the "Apps/Ifile" folder where resides my archives of
the last five years of Ifile discussions.

To consider _only_ the words "sex" and "sexy" is a severe
oversimplification.
--
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/lisp.html
Objects & Markets
"Object-oriented programming is about the modular separation of what
from how. Market-oriented, or agoric, programming additionally allows
the modular separation of why."
-- Mark Miller

Herb Martin

unread,
Aug 17, 2002, 11:03:48 AM8/17/02
to
> The point is that _all_ the words in the message are considered.
>
> For instance, if I throw my message, which conspicuously contains both
> the word "sex" and the word "sexy," purportedly surefire indications
> of spam, at ifile, the fact that it mentions Ifile several times means
> that it heads to the "Apps/Ifile" folder where resides my archives of
> the last five years of Ifile discussions.
>
> To consider _only_ the words "sex" and "sexy" is a severe
> oversimplification.

Well that makes more sense.

What about Graham's method leads one to believe that IFile
would not be considered? Several of the examples he gives
(using 'Lisp' for himself instead of 'Ifile' as you would) are
isomorphic to this issue -- he is including words from the "good
mail" as well.


--


Herb Martin
Try ADDS for great Weather too:
http://adds.aviationweather.noaa.gov/projects/adds

"Christopher Browne" <cbbr...@acm.org> wrote in message
news:ajlmod$1bspp4$1...@ID-125932.news.dfncis.de...


> A long time ago, in a galaxy far, far away, "Herb Martin"
<He...@LearnQuick.Com> wrote:
> >> And if you want to work with an existing package that has been mature
> >> for several years now, you might look at the URL below for "Ifile." I
> >> helped tune it to become pretty fast.
> >
> > IFile's documentation and download page is
> > included at the end of Graham's article.
> >
> > http://www.ai.mit.edu/~jrennie/ifile/
> >
> >> And I have to disagree somewhat with Graham's article; Naive Bayesian
> >> filtering _doesn't_ provide _quite_ as good results as he implies.
> >> Having both "sex" and "sexy" in a message does _not_ guarantee at P >
> >> 0.99 that messages will get tossed into the "spam" category.
> >
> > I am not certain of your 'naive' filtering usage with the example of
> > only "included" words. IFile's doc page describes it's algorythm as
> > "naive bayesian filtering" as well.
> >
> > Graham is using the words included in "good mail" to counter this,
> > as IFile seems to do.
>

Joe Marshall

unread,
Aug 17, 2002, 4:34:29 PM8/17/02
to

"sv0f" <no...@vanderbilt.edu> wrote in message news:none-16080...@129.59.212.53...

Perhaps this technique could be used to filter out the large
amount of crap postings on this newsgroup.

Erik Naggum

unread,
Aug 17, 2002, 4:42:29 PM8/17/02
to
* xah <x...@xahlee.org>

| (fuck Erik Naggum. Fuck you. You got it?)

Got it. Now get on with your life. Thank you.

Kaz Kylheku

unread,
Aug 17, 2002, 6:20:54 PM8/17/02
to
In article <B9838E4A.2CAF%x...@xahlee.org>, xah wrote:
> The phenomenon of spam is a human-social phenomenon. Spammers spam because
> it is effective.

That's only because you can't see spammers for the anti-social twits that they
are, who will keep spamming even when it's not effective. Or they will define
their acceptable effectiveness to be something ridiculously low, like one
positive response from ten million spams. Or even define negative responses as
good responses, so that ``don't send me this crap'' earns one a permanent spot
in their list.

Spamming is not effective in any sense of the word that an actual marketer
would comprehend.

Now, why *don't* you see spammers for the anti-social twits that they are? I
have my own idea about that.

Thien-Thi Nguyen

unread,
Aug 17, 2002, 6:51:35 PM8/17/02
to
Kaz Kylheku <k...@ashi.footprints.net> writes:

> Spamming is not effective in any sense of the word that an actual marketer
> would comprehend.

well clearly you have been lucky enough not to spend too much time around
(professional) marketers, who take great pains in safe-guarding their power to
comprehend everything positively. their job is to foist this inability to
discern the feedback loop onto others (primarily professional sales people).
this is because when business is good, nobody cares, it's only when business
is bad that self-examination is painful. it's no surprise that professional
sales people also take it to be their job to point the finger back at the
marketers.

whoever thought up the sales / marketing (organizational) partitioning was
probably a consultant weeping in anticipation of the spoils to be reaped from
the turf wars imminent. split the mind and sell aspirin...

(actually, i have no clue what your background w/ these professions are; these
ramblings are from my own limited experience as a naive geek co-founding a
chip company where the only lisp involved was emacs lisp... for round two,
i'd like to work my way up through the tool chain w/ lisp but somehow i got
distracted lo these last four years.)

thi

ilias

unread,
Aug 18, 2002, 10:47:58 AM8/18/02
to
xah wrote:
> There are two lispy big wigs, namely Paul Graham and Erik Naggum, who thinks
> their hotshot mouthing on spamming is something of value.
>
> Their outpouring, is not unlike that of damming of drivel flood.
>
> In the treatment of flood, there is the brute force of building of a dam.
> This is the no-nonsense solution of brutes the likes of technology geeks.
> Among them are the brute elite the likes of Graham'n'Naggum, who speak on
> fine engineering far ahead of fellow brutes.

...


> These brutes decline spam, but when they happen upon a spam that pleases
> them, such as a great porno site advertisement or something else that caught
> their personal interest, they will follow the advertisement.

i don't like spam.
when something is interesting (it happens, technically, tittically) i
try to push the delete-button before i read more,
sometimes i'm not able.

> For spammers,
> spam is effective if there is one response out of one thousand. Similarly,
> brutes will respond to one out of one thousand hateful spams. The prospect
> of spamming is thus kept alive by the populous brutes, everywhere, despite
> tumultuous noises they makes.
>
> As a human community or society, the treatment of spam is up to us, not
> fucking technology.

that is partly correct.

is up to us, assisted by (fucking or not) technology

> (fuck Erik Naggum. Fuck you. You got it?)

"fuck Erik Naggum".

"Fuck *you*" [relates to 'Erik Naggum', relates to 'the reader', relates
to 'Paul Graham', relates to technology-lovers?]

"you got it?"

no, please clarify.


> The gist is
> "what do we want?".

"we". who belongs to "we".

> (Note: not "what technology geeks want" or "what fucking
> unix morons want".)

are they not included in "we"?

> If we do not want spam, there are two ways to get rid of
> it: Thru law, and thru inaction.

- law
- inaction
- wisdom
- technology
- creativity
- cooperation
- understanding
- ...

solving the problem when working together.

> The former is a well-known proposal in the
> process. The latter, is possible only if human are not lazy unthinking
> beer-drinking brutes, which is never going to happen.

Take a human with a common intelligence, and place him in a group of
gorillas, he'll be a brilliant individuum (relative).

if he insists in that group on he's brilliance, the gorillas will give
him a brilliant fuck.

The "lazy unthinking beer-drinking brutes"-group belong to the
problem-domain, basicly it is the most important and unchangeable part.

When ignoring this, you declare youreself as a complete idiot.

but you're maybe simply jailous [someone wants something what anotherone
has]. Cause of your inability of "don't think - drink beer - and fuck -
be happy"

> PS i request that anyone who read so far and find it worthwhile to send me
> an email with the line "Xah, you are beautiful.". Thanks in advance.

sorry, no email.

i've place it in the subject.


Michael Sullivan

unread,
Aug 19, 2002, 2:49:12 PM8/19/02
to
xah <x...@xahlee.org> wrote:

> The phenomenon of spam is a human-social phenomenon. Spammers spam because
> it is effective. Consumers'S mouths says no but their actions says yes,
> because for the vast majority they are unthinking and happy-go-lucky brutes.

In fact, the problem with spam is not that large numbers of people
respond, but that it is so cheap to send (for the spammer anyway, since
the cost is distributed amongst the recipients and those who share their
systems and networks) that nearly *any* response is effective for them.

The problem with spam is that it is theft. If spammers actually had to
bear the costs of their spam, they would never send it, because the
response rates are ridiculously low. Since they do not, and it is cheap
and easy to send out hundreds of millions of messages, a response rate
of ten in a million is perfectly acceptable to them.

I think that Graham may be right, that if good spam filtering became
normal and automatic in nearly every email client (or server), that
response rates might eventually drop so low that it would become
worthless to spam.


Michael

--
Michael Sullivan
Business Card Express of CT Thermographers to the Trade
Cheshire, CT mic...@bcect.com

Michael Sullivan

unread,
Aug 19, 2002, 2:49:10 PM8/19/02
to
Christopher Browne <cbbr...@acm.org> wrote:

> A long time ago, in a galaxy far, far away, "Herb Martin"
> <He...@LearnQuick.Com> wrote:
> >> And if you want to work with an existing package that has been mature
> >> for several years now, you might look at the URL below for "Ifile." I
> >> helped tune it to become pretty fast.
> >
> > IFile's documentation and download page is
> > included at the end of Graham's article.
> >
> > http://www.ai.mit.edu/~jrennie/ifile/
> >
> >> And I have to disagree somewhat with Graham's article; Naive Bayesian
> >> filtering _doesn't_ provide _quite_ as good results as he implies.
> >> Having both "sex" and "sexy" in a message does _not_ guarantee at P >
> >> 0.99 that messages will get tossed into the "spam" category.
> >
> > I am not certain of your 'naive' filtering usage with the example of
> > only "included" words. IFile's doc page describes it's algorythm as
> > "naive bayesian filtering" as well.
> >
> > Graham is using the words included in "good mail" to counter this,
> > as IFile seems to do.

> The point is that _all_ the words in the message are considered.

Graham's algorithm *does* consider all the words, sort of. It does a
hash lookup on every word, and then considers the fifteen words in that
mail that are the strongest signals (whether that be a signal of "good"
mail, or "bad") and does the bayes calculation on those. It seems to me
that it wouldn't be all that computationally intensive to extend the
bayes calculation to more words.

I just did a very quick implementation of just the math and it looks
like speed is not the problem, but arbitrary precision. With thousands
of words, you easily reach past the edge of the IEEE floating point spec
for some of your intermediary values, leading to a (/ x 0) situation.
With a good arbitrary precision math library, this is not an issue, but
it also appears that using the most significant 100-500 words is likely
to produce a certain result so often that it ought to be plenty.

I fed my bayes calculation pseudo random numbers and found that it was
generating probabilities over 4 sigma one way or another more than 1/2
the time using 100 numbers. At 200 numbers, something like 80% were 5+
sigma, and a 100 run test did not produce a single probability between 5
and 95%.

So I'm guessing that using the most significant 200 numbers is unlikely
to produce results any different from doing the bayes calculation on
every last word.

The one scenario where I see trouble is a real message which for some
legitimate reason includes a forward of a spam example. If there's
enough stuff added to the real message, his over-weighting of "good"
indicators will probably tip the scale.

But if it's a fairly short forward message, followed by an actual spam
(especially with full headers), it would almost certainly be tagged as
"spam", even though, this might be somebody trading information trying
to track down a spammer. Or perhaps someone with too much time on their
hands read a spam and found it funny or otherwise interesting and
decided to pass it on to somebody.

I'm not sure how you can filter spam well without risking a false
positive in at least this case, but I suspect that this naive Bayesian
algorithm won't do the trick, unless there's a fair bit of "good"
content.

> For instance, if I throw my message, which conspicuously contains both
> the word "sex" and the word "sexy," purportedly surefire indications
> of spam, at ifile, the fact that it mentions Ifile several times means
> that it heads to the "Apps/Ifile" folder where resides my archives of
> the last five years of Ifile discussions.

> To consider _only_ the words "sex" and "sexy" is a severe
> oversimplification.

Except that he doesn't actually do this.

Rahul Jain

unread,
Aug 19, 2002, 3:28:10 PM8/19/02
to
mic...@bcect.com (Michael Sullivan) writes:

> But if it's a fairly short forward message, followed by an actual spam
> (especially with full headers), it would almost certainly be tagged as
> "spam", even though, this might be somebody trading information trying
> to track down a spammer. Or perhaps someone with too much time on their
> hands read a spam and found it funny or otherwise interesting and
> decided to pass it on to somebody.
>
> I'm not sure how you can filter spam well without risking a false
> positive in at least this case, but I suspect that this naive Bayesian
> algorithm won't do the trick, unless there's a fair bit of "good"
> content.

You can have the filter disabled for people you know won't send you
worthless messages.

--
-> -/ - Rahul Jain - \- <-
-> -\ http://linux.rice.edu/~rahul -=- mailto:rj...@techie.com /- <-
-> -X "Structure is nothing if it is all you got. Skeletons spook X- <-
-> -/ people if [they] try to walk around on their own. I really \- <-
-> -\ wonder why XML does not." -- Erik Naggum, comp.lang.lisp /- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
(c)1996-2002, All rights reserved. Disclaimer available upon request.

Robert St. Amant

unread,
Aug 19, 2002, 3:36:51 PM8/19/02
to
Kaz Kylheku <k...@ashi.footprints.net> writes:

> In article <B9838E4A.2CAF%x...@xahlee.org>, xah wrote:
> > The phenomenon of spam is a human-social phenomenon. Spammers spam because
> > it is effective.
>
> That's only because you can't see spammers for the anti-social twits that they
> are, who will keep spamming even when it's not effective. Or they will define
> their acceptable effectiveness to be something ridiculously low, like one
> positive response from ten million spams. Or even define negative responses as
> good responses, so that ``don't send me this crap'' earns one a permanent spot
> in their list.
>
> Spamming is not effective in any sense of the word that an actual marketer
> would comprehend.

From an article in this week's Newsweek, titled "Spamming the World"
(http://www.msnbc.com/news/792491.asp):

One bulk e-mailer says that when she started spamming in 1999,
she could send out 100,000 e-mails and get 25 responses. Today,
she has to send out a million messages to get the same response
(a 0.0025 percent hit rate).

It's interesting reading. I don't think spammers will ever stop (like
telemarketers), as long as they're getting *any* responses. Short of
lawsuits, that is.

--
Rob St. Amant
http://www4.ncsu.edu/~stamant

Erik Naggum

unread,
Aug 19, 2002, 4:59:30 PM8/19/02
to
* Robert St. Amant

| It's interesting reading. I don't think spammers will ever stop (like
| telemarketers), as long as they're getting *any* responses. Short of
| lawsuits, that is.

I am actually amazed that out of the million people needed to get 25
responses, there has not yet been a single potential psychopathic axe
murderer living in the spammer's city. Imagine just /one/ such case.

Joe Marshall

unread,
Aug 19, 2002, 7:26:12 PM8/19/02
to

"Michael Sullivan" <mic...@bcect.com> wrote in message news:1fh63id.15bxu5pizyf4bN%mic...@bcect.com...

>
> In fact, the problem with spam is not that large numbers of people
> respond, but that it is so cheap to send (for the spammer anyway, since
> the cost is distributed amongst the recipients and those who share their
> systems and networks) that nearly *any* response is effective for them.

Nearly any *valid* response is effective. One part of the reason that
spam works is that it is possible to `identify' the 25 people out of
the million that act upon the message. When you spam a million email
addresses most of the recipients discard or ignore the message.
The set of people that respond to the spam is *much* richer
in suckers than the original set of people identified by their
addresses.

If *every* spam yielded a (possibly bogus) response, then the
value of spamming would be severely decreased. Spamming a set
of email addresses would yield no information about which recipients
are suckers because they *all* seem to be. Putting a URL in the
spam would be useless because it would simply cause a million
automatic `hits' on the page.

> The problem with spam is that it is theft. If spammers actually had to
> bear the costs of their spam, they would never send it, because the
> response rates are ridiculously low. Since they do not, and it is cheap
> and easy to send out hundreds of millions of messages, a response rate
> of ten in a million is perfectly acceptable to them.

But a response rate of a million in a million would *not* be
acceptable.

Frode Vatvedt Fjeld

unread,
Aug 20, 2002, 6:03:11 AM8/20/02
to
Erik Naggum <er...@naggum.no> writes:

> I am actually amazed that out of the million people needed to get 25
> responses, there has not yet been a single potential psychopathic
> axe murderer living in the spammer's city. Imagine just /one/ such
> case.

And when the judge asks the axeman if he's got anything to say in his
defense, he'd say he "just wanted to help the economy and add to the
GNP. Please realize this."

--
Frode Vatvedt Fjeld

John Carroll

unread,
Aug 20, 2002, 6:32:23 AM8/20/02
to
In article <o8f89.2487$aA.632@sccrnsc02>,
"Joe Marshall" <prunes...@attbi.com> wrote:

> "Michael Sullivan" <mic...@bcect.com> wrote in message
> news:1fh63id.15bxu5pizyf4bN%mic...@bcect.com...
> >
> > In fact, the problem with spam is not that large numbers of people
> > respond, but that it is so cheap to send (for the spammer anyway, since
> > the cost is distributed amongst the recipients and those who share their
> > systems and networks) that nearly *any* response is effective for them.
>
> Nearly any *valid* response is effective. One part of the reason that
> spam works is that it is possible to `identify' the 25 people out of
> the million that act upon the message. When you spam a million email
> addresses most of the recipients discard or ignore the message.
> The set of people that respond to the spam is *much* richer
> in suckers than the original set of people identified by their
> addresses.
>
> If *every* spam yielded a (possibly bogus) response, then the
> value of spamming would be severely decreased. Spamming a set
> of email addresses would yield no information about which recipients
> are suckers because they *all* seem to be. Putting a URL in the
> spam would be useless because it would simply cause a million
> automatic `hits' on the page.

So the most effective spam processing system would extract URLs /
email / fax / telephone contacts from the spam and automatically
respond with a plausible sounding covering message (of course making
sure not to identify the sender in any way).

Then the replies from the 25 suckers would be submerged. Of course
the spammers might then resort to software that tried to
automatically detect just the real messages from the suckers and
discard the rest.

John

Michael Hudson

unread,
Aug 20, 2002, 9:04:40 AM8/20/02
to
Rahul Jain <ra...@rice.edu> writes:

> You can have the filter disabled for people you know won't send you
> worthless messages.

Until the next klez.

Cheers,
M.

--
My hat is lined with tinfoil for protection in the unlikely event
that the droid gets his PowerPoint presentation working.
-- Alan W. Frame, alt.sysadmin.recovery

Christopher Browne

unread,
Aug 20, 2002, 10:29:07 AM8/20/02
to
In the last exciting episode, Rahul Jain <ra...@rice.edu> wrote::

> You can have the filter disabled for people you know won't send you
> worthless messages.

That doesn't work. I get a lot of messages that claim to be from
"cbbr...@acm.org", which is the identity of someone I usually
_presume_ that I'd be prepared to trust fairly well.
--
(concatenate 'string "cbbrowne" "@acm.org")
http://www3.sympatico.ca/cbbrowne/sap.html
Rules of the Evil Overlord #215. "If I ever MUST put a digital timer
on my doomsday device, I will buy one free from quantum mechanical
anomalies. So many brands on the market keep perfectly good time while
you're looking at them, but whenever you turn away for a couple
minutes then turn back, you find that the countdown has progressed by
only a few seconds." <http://www.eviloverlord.com/>

Rahul Jain

unread,
Aug 20, 2002, 2:01:39 PM8/20/02
to
Christopher Browne <cbbr...@acm.org> writes:

> In the last exciting episode, Rahul Jain <ra...@rice.edu> wrote::
> > You can have the filter disabled for people you know won't send you
> > worthless messages.
>
> That doesn't work. I get a lot of messages that claim to be from
> "cbbr...@acm.org", which is the identity of someone I usually
> _presume_ that I'd be prepared to trust fairly well.

I forgot to mention that you trace the Received headers and only
accept what trusted servers say (and assume that the servers on the
path between you and a friend are trusted).

synthespian

unread,
Aug 21, 2002, 12:12:14 AM8/21/02
to
On Mon, 19 Aug 2002 17:59:30 -0300, Erik Naggum wrote:

>
> I am actually amazed that out of the million people needed to get 25
> responses, there has not yet been a single potential psychopathic axe
> murderer living in the spammer's city. Imagine just /one/ such case.
>

Hahahaha... :-) But that's because psycopaths are 1% of the
population (read Robert Hare). We need more spam! :-))

Cheers,
Henry
--
________________________________________________________________
Micro$oft-Free Human 100% Debian GNU/Linux
KMFMS "Bring the genome to the people!

Raffael Cavallaro

unread,
Aug 21, 2002, 12:52:50 AM8/21/02
to
"Joe Marshall" <prunes...@attbi.com> wrote in message news:<pry79.59612$983.72590@rwcrnsc53>...

> Perhaps this technique could be used to filter out the large
> amount of crap postings on this newsgroup.

Perhaps you were being facetious, but the Bayesian approach described
by Paul Graham could certainly be used to filter out whatever you
consider to be undesirable content from *any* set of text based
messages. And it "learns," adapting to novel garbage as it arises.

Raf

Xah Lee

unread,
Aug 21, 2002, 1:55:28 AM8/21/02
to
A series of replies in this thread reminded me Spy vs Spy.

Recall that Spy vs Spy was a popular comic by Antonio Prohias that
appears in Mad magazine.

Here's a few snap shots:
http://images.amazon.com/images/P/0823050211.01.LZZZZZZZ.jpg
http://www.collectmad.com/britishcovers/pro_spy1.jpg
http://www.collectmad.com/collectibles/bbsvsc.jpg

the theme being two archenemic spies, colored one white and one black,
who better each other on schemes and technologies. One creates a
voice-recognition missile, then the other invents a voice-exchanging
device. The final frame of the comic would have the second spy
shrieking with mirth and a victory pose over the mishaps of the other.
Turn to the next installment and the winner & loser are reversed: We
see one spy excitedly plans a booby trap. When he enters the other
spy's house to install the bomb, he got blown up because the other spy
has spied on his scheme. Again the hilariously smug victory pose over
the misfortune of the other.

Their fight is endless. Over and over we read with glee over the silly
stratagems and incredible technologies they devices that befalls on
themselves.

As i sit here and read the technology geeking morons fighting with
spammers.


Bibliography:

some snap shot
http://www.leedberg.com/mad/spies/spies.html

a mad cover featuring Spy vs Spi
http://www.collectmad.com/britishcovers/5thmad.htm
http://www.collectmad.com/collectibles/bbsvsc.htm

Spy Vs. Spy: The Complete Casebook by Antonio Prohias
http://www.amazon.com/exec/obidos/ASIN/0823050211/xahhome-20

A dress-up of Spy vs Spy
http://members.aol.com/nebula5/spyvspy.html

Xah
x...@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

ilias

unread,
Aug 21, 2002, 5:50:01 AM8/21/02
to
Xah Lee wrote:
> A series of replies in this thread reminded me Spy vs Spy.
>
> Recall that Spy vs Spy was a popular comic by Antonio Prohias that
> appears in Mad magazine.
one of my favorite.

> Their fight is endless. Over and over we read with glee over the silly
> stratagems and incredible technologies they devices that befalls on
> themselves.
>
> As i sit here and read the technology geeking morons fighting with
> spammers.

Spy vs. Spy categories:

- "Spy vs. Spy"-fighters.

- "Spy vs. Spy"-talkers.

wake up.

Joe Marshall

unread,
Aug 21, 2002, 10:49:16 AM8/21/02
to

"Xah Lee" <x...@xahlee.org> wrote in message news:7fe97cc4.02082...@posting.google.com...

> A series of replies in this thread reminded me Spy vs Spy.
>
> As i sit here and read the technology geeking morons fighting with
> spammers.

Touché.

Håkon Alstadheim

unread,
Aug 22, 2002, 5:33:39 AM8/22/02
to
Michael Hudson <m...@python.net> writes:

> Rahul Jain <ra...@rice.edu> writes:
>
>> You can have the filter disabled for people you know won't send you
>> worthless messages.
>
> Until the next klez.
>

All this should in _theory_ be taken care of automatically by Graham's
filter: Known senders will be added to the "good" set of words (with
their usual posting hosts etc.) with P=1, and after a short while a
virus will be added to the "bad" set, with P=1. At the same time the
known senders will have their goodness reduced.
--
Håkon Alstadheim, hjemmepappa.

Christopher Browne

unread,
Aug 22, 2002, 7:49:23 AM8/22/02
to
A long time ago, in a galaxy far, far away, Rahul Jain <ra...@rice.edu> wrote:
> mic...@bcect.com (Michael Sullivan) writes:
>
>> But if it's a fairly short forward message, followed by an actual spam
>> (especially with full headers), it would almost certainly be tagged as
>> "spam", even though, this might be somebody trading information trying
>> to track down a spammer. Or perhaps someone with too much time on their
>> hands read a spam and found it funny or otherwise interesting and
>> decided to pass it on to somebody.
>>
>> I'm not sure how you can filter spam well without risking a false
>> positive in at least this case, but I suspect that this naive Bayesian
>> algorithm won't do the trick, unless there's a fair bit of "good"
>> content.
>
> You can have the filter disabled for people you know won't send you
> worthless messages.

That's one method.

Another is that the headers of a 'real' message will look rather more
like those appropriate for 'real' message folders.

This sort of thing is the pathological sort of situation for such
discrimination systems, and if it _does_ get confused at this, it
shouldn't come as any great surprise, because _any_ kind of censorship
system will have problems with this sort of thing.

For instance, if "pictures of naked people" are considered to be
obscenity, and thereby "illegal" (in some manner), what happens when
you have:

a) An anatomy guide, which _intentionally_ has pictures of naked
people?

b) An issue of the "Journal of Surgery" which not only has pictures
of "naked people," but further, in the special issue on "Treating
Victims of Sexual Crimes" issue, has Really, Really Nasty Stuff?

c) A special issue of "Abnormal Psychology Today" specifically on the
effects of viewing pornography?

d) An explicit documentary about the ill effects of pornography on
the status of women? (This film exists as the Canadian NFB's _Not
a Love Story_, where viewings apparently are often plagued by
visits by the police carrying notice of obscenity charges.)

A researcher or doctor, doing their work, occasionally needs this sort
of material.

This sort of material is fairly likely, in other sorts of hands, to be
treated in a prurient manner.

If you were to want to send me an archive of "spam," it would be
advisable for you to bundle it carefully, putting it into a
compressed, possibly even encrypted, archive, so that it _doesn't_
look like spam whilst in transit.

Ditto if you and I were trying to share a copy of a "virulent"
computer virus. Don't send it as is, so that I might conceivably be
infected by it: when the CDC transfers potentially dangerous
substances from one lab to another, they need to bundle the substances
up _carefully_.

I've got a fairly nice "corpus" of spam; if someone wanted a copy, you
can be _sure_ that I'd put warning labels on it mentioning (amongst
other things):

-> This archive contains illegal business proposals;

-> This archive contains sexually oriented material, some of which is
_highly_ offensive, some of which is likely to be considered illegal
obscenity in your jurisdiction;

-> This may contain attempts at computer security exploits, which
might be damaging to your computer system.

"If you do not expressly promise to be careful and discreet in what
you do with what I'm sending you, I'm certainly not giving it to you.
I don't want trouble to result from sending it to you."

If you think I'm not serious, think again. I would think _hard_ about
the legal implications before I would _consider_ passing on a copy of
my "spam corpus," and the possibility of unexpected legal action most
_certainly_ would be in my mind. That Russian fellow spent much of
last year in jail just because of an _accusation_ of breaking the
DMCA. The makers of _Not A Love Story_ saw the chilling effect that
people get arrested for showing their movie.
--
(concatenate 'string "aa454" "@freenet.carleton.ca")
http://www.ntlug.org/~cbbrowne/ifilter.html
"Over a hundred years ago, the German poet Heine
warned the French not to underestimate the power of ideas:
philosophical concepts nurtured in the stillness of a
professor's study could destroy a civilization."
--Isaiah Berlin in /The Power of Ideas/

Herb Martin

unread,
Aug 22, 2002, 12:31:50 PM8/22/02
to
> This sort of thing is the pathological sort of situation for such
> discrimination systems, and if it _does_ get confused at this, it
> shouldn't come as any great surprise, because _any_ kind of censorship
> system will have problems with this sort of thing.
>
> For instance, if "pictures of naked people" are considered to be
> obscenity, and thereby "illegal" (in some manner), what happens when
> you have:
>
> a) An anatomy guide, which _intentionally_ has pictures of naked
> people?
>
> b) An issue of the "Journal of Surgery" which not only has pictures
> of "naked people," but further, in the special issue on "Treating
> Victims of Sexual Crimes" issue, has Really, Really Nasty Stuff?
>
> c) A special issue of "Abnormal Psychology Today" specifically on the
> effects of viewing pornography?
>
> d) An explicit documentary about the ill effects of pornography on
> the status of women? (This film exists as the Canadian NFB's _Not
> a Love Story_, where viewings apparently are often plagued by
> visits by the police carrying notice of obscenity charges.)

You know Graham discusses this sort of thing
in his article and indicates it handles it very well.

--
Herb Martin, PP-SEL
(...and aerobatic student)

"Christopher Browne" <cbbr...@acm.org> wrote in message

news:ak2j42$1f74v0$2...@ID-125932.news.dfncis.de...

Reply all
Reply to author
Forward
0 new messages