Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Newbie: floating point optimization

45 views
Skip to first unread message

Christian Hofer

unread,
Jan 13, 2004, 3:31:55 PM1/13/04
to
Hello,

I have just a small question (trying to make my small numeric
integration exercise a bit faster):

When I write:

(let ((s 0.0))
(declare (double-float s))
(...))

is 0.0 automatically initialized as a double-float then, or is this
declaration a wrong presumption about s, or what?

Should I write s.th. else? How do I write a double-float number in Lisp
explicitly?

Chris

Joe Marshall

unread,
Jan 13, 2004, 4:35:19 PM1/13/04
to
Christian Hofer <ch_...@gmx.de> writes:

0.0d0

Gareth McCaughan

unread,
Jan 13, 2004, 4:36:31 PM1/13/04
to
Christian Hofer wrote:

> When I write:
>
> (let ((s 0.0))
> (declare (double-float s))
> (...))
>
> is 0.0 automatically initialized as a double-float then, or is this
> declaration a wrong presumption about s, or what?

I'm afraid it's a wrong presumption about s. You're
giving it a single-float value and claiming it's a
double-float.

> Should I write s.th. else? How do I write a double-float number in
> Lisp explicitly?

0.0d0.

--
Gareth McCaughan
.sig under construc

Barry Margolin

unread,
Jan 13, 2004, 5:07:52 PM1/13/04
to
In article <bu1kjq$sqp$1...@online.de>, Christian Hofer <ch_...@gmx.de>
wrote:

> Hello,
>
> I have just a small question (trying to make my small numeric
> integration exercise a bit faster):
>
> When I write:
>
> (let ((s 0.0))
> (declare (double-float s))
> (...))
>
> is 0.0 automatically initialized as a double-float then, or is this
> declaration a wrong presumption about s, or what?

A declaration is a promise about the value of the variable, and the
consequences are undefined if the promise isn't kept. So unless you
have *READ-DEFAULT-FLOAT-FORMAT* set to DOUBLE-FLOAT, the above code is
incorrect.

> Should I write s.th. else? How do I write a double-float number in Lisp
> explicitly?

0.0d0

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA

Adam Warner

unread,
Jan 14, 2004, 6:26:19 PM1/14/04
to
Hi Barry Margolin,

> A declaration is a promise about the value of the variable, and the
> consequences are undefined if the promise isn't kept. So unless you
> have *READ-DEFAULT-FLOAT-FORMAT* set to DOUBLE-FLOAT, the above code is
> incorrect.
>
>> Should I write s.th. else? How do I write a double-float number in Lisp
>> explicitly?
>
> 0.0d0

...or 0d0, 1d0, 10d0 as shorthand notation for 0.0d0, 1.0d0 and 10.0d0
respectively.

Christian, just don't fall into the trap of thinking of the d as a
placeholder for the decimal point. Otherwise you might start writing
0d1, thinking you've entered 0.1d0 instead of 0.0d0.

Regards,
Adam

Christian Hofer

unread,
Jan 18, 2004, 11:10:02 AM1/18/04
to
Thank you for your answers!

Luckily LispWorks uses double-float as default anyway, so nothing went
wrong.

But experimenting with optimization, I see that it shows some strange
results. Optimizing compilation for speed does not show any effect.
Using "the" sometimes increases the time for evaluation.

But on the other hand, using the "time"-function seems to be unreliable
anyway: it shows very different times for each evaluation. Is there a
way to hold it stable? How do I know, if a lesser need for time is
caused by an optimization and not by some random causes?

(I don't want to focus too much on optimization at the moment generally,
it's just that I don't want my homework to be ten times slower than that
of those who have used Java.)

Chris

Barry Margolin schrieb:


> In article <bu1kjq$sqp$1...@online.de>, Christian Hofer <ch_...@gmx.de>

[...]

Håkon Alstadheim

unread,
Jan 19, 2004, 12:33:32 AM1/19/04
to
Christian Hofer <ch_...@gmx.de> writes:

> But experimenting with optimization, I see that it shows some
> strange results. Optimizing compilation for speed does not show any
> effect. Using "the" sometimes increases the time for evaluation.

Might be garbage collection.

Run test many times: (time (dotimes (i 10000) (test ))).


> (I don't want to focus too much on optimization at the moment
> generally, it's just that I don't want my homework to be ten times
> slower than that of those who have used Java.)

Make gc non-verbose, and defer it to after you've printed results
if you can.

For cmucl it would be something like:
(setf ext:*gc-verbose* nil)
;; maybe some (ext:gc :full t) strategically placed
(unwind-protect
(progn
(ext:gc-off)
(your-stuff)
(make-sure-to-give-some-output-for-user-to-look-at))
(ext:gc-on))
--
Håkon Alstadheim, hjemmepappa.

Erik Naggum

unread,
Jan 19, 2004, 2:14:02 AM1/19/04
to
* Christian Hofer

| Using "the" sometimes increases the time for evaluation.

It is permissible for an implementation to do run-time type-checking
to ensure that THE forms are honest. They could continue to do this
no matter what your speed optimize declarations are, but should turn
them off if you specify (declare (optimize (safety 0))).

| But on the other hand, using the "time"-function seems to be unreliable
| anyway: it shows very different times for each evaluation.

Modern computers do a lot of unannounced work, especially if they are
active on any sort of network. Updating system timers is usually done
in response to interrupts, and may take place with any latency that
keeps the the clock accurate enough for comfort, so it is prudent for
the user of timers to ensure that the timing granularity is coarse
enough that values are trustworthy. This is why it is often a very
good idea to repeat the operation that is to be timed a /huge/ number
of times, or to use timer alarms that let you run at full speed for a
period of time only to return the number of repetitions it managed to
complete. Of course, the results obtained under such conditions are
nowhere near the execution speed you can expect when performing the
operation once or under wildly different conditions. For this reason,
profiling is an art best left to experts or those willing to become
experts, which will take a tremendous amount of time, investigation of
hardware and the actually executed machine code, memory arrangement,
etc. It is very easy to be attracted to the easily measured and to
relegate the unmeasurable to «mystic noise». The more precise values
you get from a measurement method, the more you have to brace yourself
to resist the sexiness of apparent simplicity and elegance. Computers
exhibit the placebo effect, too, and they will give you good results
along any scale you use, so you have to be able to predict the results
with inordinate precision and duly investigate any deviation from your
meticulous predictions. If you only measure «something» and are happy
with every positive development in the measured values, regardless of
cause, you will end up with good measurement values of something that
you would never have done. Performance tuning experts are very often
subjected to code that has been «improved» by people who has done just
about anything to shave off a millisecond here and a millisecond there
with absolutely no regard for the performance of anything else, least
of all the overall performance.

| (I don't want to focus too much on optimization at the moment
| generally, it's just that I don't want my homework to be ten times
| slower than that of those who have used Java.)

Well, you have at least figured out the optimal bait to entice Common
Lisp programmers to come to your aid, but will you get your academic
degree ten times faster if your homework is just as slow as the Java
solution of your competitors in the rat race? (No need to answer. :)

I suggest that you ignore performance completely and focus on two other
properties that performance obsession tends to ignore completely: That
it be /correct/, and that it not be /wasteful/. Waste indicates that
you lack understanding, incorrect indicates that you lack attention to
detail. High performance indicates a lucky match between you and the
execution vehicle. For instance, one contributor here recently posted
a function that inverted symbol names that was extremely wasteful, but
which expressed the core idea very well. A quality implementation of
this function would use both caching of the inverted result with the
symbol and an efficient state machine that determined that it should
not invert the string as soon as two characters with different case
were detected.

The Java crowd is actually extremely educational when it comes to this
whole question of optimization. Sun developed a environment that was
known to be slow as molasses, but then worked really hard at finding
ways to make it run faster, while the applications were extremely hard
to optimize for speed. These days, it takes even more effort to write
a better-performing solution in C or C++ than to write it in Java, and
it just isn't worth it, anymore. Of course, this means that instead
of being employable as a highly rewarded performance tweaker in C or
C++, you have to compete with a billion programmers in India and China
who rely on the thousand or so developers of the run-time development.

Let me connect premature and unnecessary optimization with a known evil
that should at least work through the guilt-by-association mechanism:
The reason we see so much spam is not that it works, but that it does
not work. Those who engage in this crime believe that when they get a
low response rate, the best solution is to increase the volume of spam
so that they will get more responses. Locally, they optimize for more
responses, but globally, they reduce the likelihood of being heard at
all, increase the likelihood of /never/ getting a customer that might
have bought their goods or services if they had discovered them on
their own in a respectable advertising venue, and increase the cost of
marketing for all marketers. There is, however, not a shred of doubt
that those who engage in marketing through unwanted e-mail both rate
their marketing strategy a success and optimize the only way they can
measure. Had they been (a whopping lot) smarter, they would not have
cared just about the number of sales they made, but about the global
response rate to unwanted e-mail, which has dropped to less than one
response in 10 million messages and will drop to less than one in a
billion messages before the end of 2004 at current spam growth rates.
This happens because those who engage in this crime actually receive
responses from people who are so stupid they should be terminated on
the spot, but what research has been done on that pathetic demographic
has shown that they are the kind that needs to be fooled once before
they get it, so the market for virgin fools is rapidly diminishing.

The morale of this story is that if you optimize for the measurable
quantity and ignore the unmeasured and maybe unmeasurable quantities,
you end up annoying close to a billion people on the Internet.

--
Erik Naggum | Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

Don Geddis

unread,
Jan 19, 2004, 1:52:43 PM1/19/04
to
Erik Naggum <er...@naggum.no> writes:
> Those who engage in this crime [spam] believe that when they get a

> low response rate, the best solution is to increase the volume of spam
> so that they will get more responses. Locally, they optimize for more
> responses, but globally, they reduce the likelihood of being heard at
> all, increase the likelihood of /never/ getting a customer that might
> have bought their goods or services if they had discovered them on
> their own in a respectable advertising venue, and increase the cost of
> marketing for all marketers. There is, however, not a shred of doubt
> that those who engage in marketing through unwanted e-mail both rate
> their marketing strategy a success and optimize the only way they can
> measure. Had they been (a whopping lot) smarter, they would not have
> cared just about the number of sales they made, but about the global
> response rate to unwanted e-mail

But there is a "tragedy of the commons" problem here. Even if a few of the
spammers were sufficiently smart, it would take global cooperation for them
all to limit their volume. As long as there exist a few "cheaters", then even
the smart spammers really have no alternative other than to increase volume.

-- Don
_______________________________________________________________________________
Don Geddis http://don.geddis.org/ d...@geddis.org
I think there should be a thirty-dollar bill, because how many times have you
tried to buy something with a twenty-dollar bill but you didn't have enough
money?
-- Deep Thoughts, by Jack Handey [1999]

Coby Beck

unread,
Jan 19, 2004, 8:11:14 PM1/19/04
to

"Don Geddis" <d...@geddis.org> wrote in message
news:87smibah...@sidious.geddis.org...

> Erik Naggum <er...@naggum.no> writes:
> > Those who engage in this crime [spam] believe that when they get a
> > low response rate, the best solution is to increase the volume of spam
> > so that they will get more responses. Locally, they optimize for more
> > responses, but globally, they reduce the likelihood of being heard at
> > all, increase the likelihood of /never/ getting a customer that might
> > have bought their goods or services if they had discovered them on
> > their own in a respectable advertising venue, and increase the cost of
> > marketing for all marketers. There is, however, not a shred of doubt
> > that those who engage in marketing through unwanted e-mail both rate
> > their marketing strategy a success and optimize the only way they can
> > measure. Had they been (a whopping lot) smarter, they would not have
> > cared just about the number of sales they made, but about the global
> > response rate to unwanted e-mail
>
> But there is a "tragedy of the commons" problem here. Even if a few of
the
> spammers were sufficiently smart, it would take global cooperation for
them
> all to limit their volume. As long as there exist a few "cheaters", then
even
> the smart spammers really have no alternative other than to increase
volume.

This misses the point. An SMS (Sufficiently Smart Spammer) knows increased
background noise /does not help/. Why would they add to it by having a
"keep up with the Jones" attitude? True they would continue to suffer from
non-SMS's but they (hypothetically) know that 1 billion buckets of gasoline
is not better than 900 million when trying to put out a fire.

An SMS would presumably only target a higher-probabilty-of-response market,
damaged by non-SMS's or not...

--
Coby Beck
(remove #\Space "coby 101 @ big pond . com")


Brian Mastenbrook

unread,
Jan 19, 2004, 8:20:03 PM1/19/04
to
In article <87smibah...@sidious.geddis.org>, Don Geddis
<d...@geddis.org> wrote:

> I think there should be a thirty-dollar bill, because how many times have you
> tried to buy something with a twenty-dollar bill but you didn't have enough
> money?
> -- Deep Thoughts, by Jack Handey [1999]

If there were a $30 bill, then everything would cost $29.95 pre-tax.

--
Brian Mastenbrook
http://www.cs.indiana.edu/~bmastenb/

Erik Naggum

unread,
Jan 19, 2004, 10:26:58 PM1/19/04
to
* Don Geddis

| But there is a "tragedy of the commons" problem here.

More than this, the problem is that e-mail is like the commons.

| Even if a few of the spammers were sufficiently smart, it would take
| global cooperation for them all to limit their volume.

Nah, the sufficiently smart would simply not spam.

One of the most bizarrely unpleasant things that happened when I got
naggum.no online at the new ISP was all the spam that is evidently
being sent with a return address of <junk>@naggum.no. The default
setting for the postmaster account was a catch-all, so I got to see
hundreds of returned messages from mailer-daemons around the world
that rained down on the server in the first few hours of operation
alone. It was really very depressing. But what can one man do? I
had to turn off the catch-all function and live in ignorant bliss, at
least signalling to other postmasters that naggum.no was not the true
origin of the messages that got bounced.

Since I got this service online on 2004-01-08, and the word started to
spread around the DNS, the spamfilter has killed 446 messages just to
my mailbox and 190 messages with viruses attached. This is an address
that was down for 9 months, but is now in its 10th year, and it shows:
There are still spammers who send mail to messages-IDs dating back to
1994, and the SGML mailboxes continue to receive lots and lots of junk
despite being discontinued in 1996. It is about one third of what my
1987-vintage University of Oslo address receives, but still, I'm quite
impressed with the persistance of these drooling idiots.

But speaking of mail, I got a tremendous amount of mail after I popped
my head in here barely a week ago, and between cleaning up the spam
and other stuff, it has taken me a week to respond to all of it. If
you sent me mail and have not received a response by now, it might
have been marked as spam and deleted. I would like to know if this
actually happened, so would you please try again? Thanks.

What worried me most about the spam was whether naggum.no had been
blacklisted (for other reasons than disliking my flames and stuff :),
but this appears not to have happened, but it's kind of hard to tell.
If anyone sent me mail and it failed to get delivered, please consider
re-sending it to en...@ifi.uio.no. Again, thanks.

I keep wondering what's wrong with the spammers, but sometimes I have
to ask myself what they believe they are doing /right/. The very
funny Harper's Index in the January 2004 issue lists two items that I
found both alarming and explanatory:

Rank of rhinoplasty and liposuction among the most comon plastic
surgeries performed on men in the U.S. : 1, 2

Rank of rhinoplasty and penile enlargement among those most commonly
performed on men in the U.K. : 2, 1

So now we know which population with toy dicks are responsible for at
least one of the major ills of spamming, and this is probably another
tragedy of the commons.

Thomas F. Burdick

unread,
Jan 20, 2004, 2:08:38 AM1/20/04
to
Erik Naggum <er...@naggum.no> writes:

> What worried me most about the spam was whether naggum.no had been
> blacklisted (for other reasons than disliking my flames and stuff :),
> but this appears not to have happened, but it's kind of hard to tell.
> If anyone sent me mail and it failed to get delivered, please consider
> re-sending it to en...@ifi.uio.no. Again, thanks.

I'd be surprised if you were blacklisted. While you were off the net,
there were a couple incidents of huge increases in spam, all with
faked origins. Also there were a couple mass-spamming viruses, which
aren't crippling the email infrastructure anymore (for now), but are
still sending a remarkable amount of crap. Your figure of 450 spams
at abt 50% spam sounds right to me -- I'm getting about 300/day, about
50% of which contain viruses or had them stripped.

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Espen Vestre

unread,
Jan 20, 2004, 4:02:43 AM1/20/04
to
Erik Naggum <er...@naggum.no> writes:

> alone. It was really very depressing. But what can one man do? I
> had to turn off the catch-all function and live in ignorant bliss, at
> least signalling to other postmasters that naggum.no was not the true
> origin of the messages that got bounced.

Hmm. I haven't turned off my own catch-all account. Misuses of my
domain come in 'showers' every now and then, maybe you were too
quick to turn it off?
--
(espen)

Ingvar Mattsson

unread,
Jan 20, 2004, 6:28:38 AM1/20/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

FWIW, I used to "not have a spam problem" (that is, the ratio of spam
to legitimate email was low enough that the spam was not a
distraction). In the last 3-4 months, it's gone to spam being the
*major* part of my mailbox. Addresses ending up there have been in
existence (and used on Usenet) from 1993, 1996 and somewhere around
1999 (and one only since last year).

In the last 96 hours, I've had 1221 emails delivered, of those 1096
have ended up in a mail box (that is, they do not exhibit extreme
symptoms of being email worms), probably mine.

//Ingvar
--
When the SysAdmin answers the phone politely, say "sorry", hang up and
run awaaaaay!
Informal advice to users at Karolinska Institutet, 1993-1994

Espen Vestre

unread,
Jan 20, 2004, 6:51:15 AM1/20/04
to
Ingvar Mattsson <ing...@cathouse.bofh.se> writes:

> distraction). In the last 3-4 months, it's gone to spam being the
> *major* part of my mailbox. Addresses ending up there have been in

but of course. My private mail box usually gets less than 10
legitimate mails each day, but it gets at least 100 spams.

Just like Erik, I've also been hit by spammers misusing my domain,
and got an enormous load of bounces from AOL (AOL seems to relay all
incoming mail through "dumb" relay servers, so they represent a much
bigger problem for the owner of the abused domain than those mail
servers which are able to say "user unknown" directly to the spam
client). (I actually have a specific mail filtering rule for spam
bounces from AOL).

What I did notice, though, is that the misuse of my domain usually
comes in "showers" - I'll get a hundered bounces from AOL over a
day or two, and then it'll be silent for days or weeks.

Another interesting point: There was no single spam source when
I investigated the mails coming in during one of these "showers".
All the mail seems to have originated in single machines all over
the world *witout* smtp servers, IMHO a clear indication that the
machinery used for delivery consists of a network of ordinary pcs
which run some trojan software that includes an smtp client but
not an smtp server (they probably use other notification methods
with the client polling for spam to deliver on e.g. irc channels).
--
(espen)

Tim Bradshaw

unread,
Jan 20, 2004, 7:01:41 AM1/20/04
to
* Ingvar Mattsson wrote:

> FWIW, I used to "not have a spam problem" (that is, the ratio of spam
> to legitimate email was low enough that the spam was not a
> distraction). In the last 3-4 months, it's gone to spam being the
> *major* part of my mailbox. Addresses ending up there have been in
> existence (and used on Usenet) from 1993, 1996 and somewhere around
> 1999 (and one only since last year).

Me too. I've struggled with a Bayesian thing but it really doesn't
cope: I think it is failing because there isn't enough good mail to
train it on (I get hardly any `real' mail: probably 1% of my mail is
real), so it essentially classifies everything as spam, which I then
have to wade through. I may try feeding it my whole mailbox as good
to give it some more data to learn from, but that's only a couple of
days worth of spam now, so I'm not sure if it will work.

I'm curious as to how news remains so spam-free: I presume it's
because humans do the classifying, and it only takes a relatively tiny
number of people willing to sacrifice their lives to the cause to
clean up news spam for everyone.

I'm planning on using market forces as a deterrent: I can easily be
contacted by SMS, which I will look at, and which will cost real money
to send me (of course, in my case this is just being pretentious
because I'm hardly swamped by real mail, but I just can't face the
hour/day I have to spend going through spam: that's 10% of my working
day). Physical mail would be a good solution too, though you'd need a
PO box for that.

--tim


Ingvar Mattsson

unread,
Jan 20, 2004, 8:46:24 AM1/20/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

[ SNIP ]


> Another interesting point: There was no single spam source when
> I investigated the mails coming in during one of these "showers".
> All the mail seems to have originated in single machines all over
> the world *witout* smtp servers, IMHO a clear indication that the
> machinery used for delivery consists of a network of ordinary pcs
> which run some trojan software that includes an smtp client but
> not an smtp server (they probably use other notification methods
> with the client polling for spam to deliver on e.g. irc channels).

they tend to include an SMTP server listening on a non-standard port,
according to the binary dissections I have seen described (they also
have a separate control channel to "port hop"[1], to make it harder to
scan for them, both SMTP port and control port port-hop). they then
speak normal SMTP to control whereto the spam is forwarded.

Nasty pieces of software. Quite a few of them propagate either by
exploiting remotely-visible services or by sending themselves on as
mail worms.

//Ingvar (can we stop talking about work, now?)
[1] Sort-of like frequency-hopping, to make direction-finding harder.
--
Self-referencing
Five, seven, five syllables
This haiku contains

Ingvar Mattsson

unread,
Jan 20, 2004, 8:51:31 AM1/20/04
to
Tim Bradshaw <t...@cley.com> writes:

> * Ingvar Mattsson wrote:
>
> > FWIW, I used to "not have a spam problem" (that is, the ratio of spam
> > to legitimate email was low enough that the spam was not a
> > distraction). In the last 3-4 months, it's gone to spam being the
> > *major* part of my mailbox. Addresses ending up there have been in
> > existence (and used on Usenet) from 1993, 1996 and somewhere around
> > 1999 (and one only since last year).
>
> Me too. I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is
> real), so it essentially classifies everything as spam, which I then
> have to wade through. I may try feeding it my whole mailbox as good
> to give it some more data to learn from, but that's only a couple of
> days worth of spam now, so I'm not sure if it will work.

I initially fed by Bayesian filter with "mail saved away". I *do* tend
to save most mail I get (two mailing lists get mostly-read, then
mostly-deleted), so I did have a couple of years' wotrh to feed it
with.

> I'm curious as to how news remains so spam-free: I presume it's
> because humans do the classifying, and it only takes a relatively tiny
> number of people willing to sacrifice their lives to the cause to
> clean up news spam for everyone.

Too small a user-base (relatively speaking). It used to be more common
in the mid-to-late 90s (including special-purpose software tagging
itself in the header and with "download test version or buy registered
version heer" URLs).

> I'm planning on using market forces as a deterrent: I can easily be
> contacted by SMS, which I will look at, and which will cost real money
> to send me (of course, in my case this is just being pretentious
> because I'm hardly swamped by real mail, but I just can't face the
> hour/day I have to spend going through spam: that's 10% of my working
> day). Physical mail would be a good solution too, though you'd need a
> PO box for that.

I'd rather give up physical mail, if I only could. Of course, taht
looks less attractive now that my email inbox is getting as
annoyingly-full of spam as my physical mailbox gets by UCPC[1].

Saying that, I do get some 1000 mails a day in my work mail box taht
need to be checked, so I *do* have good mailbox pattern recognition
skills.

//Ingvar
[1] Unsolicited Commercial Physical Correspondence
--
(defun m (f)
(let ((db (make-hash-table :test #'equal)))
#'(lambda (&rest a)
(or (gethash a db) (setf (gethash a db) (apply f a))))))

Espen Vestre

unread,
Jan 20, 2004, 8:59:50 AM1/20/04
to
Ingvar Mattsson <ing...@cathouse.bofh.se> writes:

> they tend to include an SMTP server listening on a non-standard port,
> according to the binary dissections I have seen described (they also
> have a separate control channel to "port hop"[1], to make it harder to
> scan for them, both SMTP port and control port port-hop). they then
> speak normal SMTP to control whereto the spam is forwarded.

Hmm, I did portscan a few of these, and I did not see any obvious
servers. And if _I_ were a spammer, I'd choose a polling client
solution, it would get me through a lot of firewalls/nat routers.
--
(espen)

Ingvar Mattsson

unread,
Jan 20, 2004, 9:10:49 AM1/20/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

As I understand it, they shift port about every 2-3 seconds, so if the
port scan took longer to go through ports 1024-65535, you may well
have missed it. There's some tha poll, though (connect to one ior more
web pages, grab instructions for where to grab "spam" and "address
list", download, send mail off).

//Ingvar

Paul Wallich

unread,
Jan 20, 2004, 10:03:32 AM1/20/04
to
Tim Bradshaw wrote:

> * Ingvar Mattsson wrote:
>
>
>>FWIW, I used to "not have a spam problem" (that is, the ratio of spam
>>to legitimate email was low enough that the spam was not a
>>distraction). In the last 3-4 months, it's gone to spam being the
>>*major* part of my mailbox. Addresses ending up there have been in
>>existence (and used on Usenet) from 1993, 1996 and somewhere around
>>1999 (and one only since last year).
>
>
> Me too. I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is
> real), so it essentially classifies everything as spam, which I then
> have to wade through. I may try feeding it my whole mailbox as good
> to give it some more data to learn from, but that's only a couple of
> days worth of spam now, so I'm not sure if it will work.

Ultimately Bayesian filters seem to be a dead end because spammers have
them too. And the simple versions at least can be fooled by something
that contains a lot of probably-not-spam tokens along with the
most-probably-spam tokens. The next advance will probably be some kind
of learning algorithms that can recognize certain tokens, patterns or
meta-patterns (such as html faux codes or comments in the middle of
words) as definitive spam indicators.

paul

Raymond Wiker

unread,
Jan 20, 2004, 10:23:23 AM1/20/04
to
Paul Wallich <p...@panix.com> writes:

> Ultimately Bayesian filters seem to be a dead end because spammers
> have them too. And the simple versions at least can be fooled by
> something that contains a lot of probably-not-spam tokens along with
> the most-probably-spam tokens. The next advance will probably be some
> kind of learning algorithms that can recognize certain tokens,
> patterns or meta-patterns (such as html faux codes or comments in the
> middle of words) as definitive spam indicators.

SpamAssassin uses a number of filters, and add up the scores
for these filters. Tuning and configuration of SpamAssassin is not
straightforward; I wonder if it would be possible to apply genetic
algorithms to find good combinations of filters? If so, then this
might be a near-perfect application for Common Lisp...

--
Raymond Wiker Mail: Raymon...@fast.no
Senior Software Engineer Web: http://www.fast.no/
Fast Search & Transfer ASA Phone: +47 23 01 11 60
P.O. Box 1677 Vika Fax: +47 35 54 87 99
NO-0120 Oslo, NORWAY Mob: +47 48 01 11 60

Try FAST Search: http://alltheweb.com/

Tim Bradshaw

unread,
Jan 20, 2004, 10:53:54 AM1/20/04
to
* Raymond Wiker wrote:
> SpamAssassin uses a number of filters, and add up the scores
> for these filters. Tuning and configuration of SpamAssassin is not
> straightforward; I wonder if it would be possible to apply genetic
> algorithms to find good combinations of filters? If so, then this
> might be a near-perfect application for Common Lisp...

I think the problem is that something has to score the results, and
that requires a human to look at the spam and nonspam, and that
defeats the object. I don't get enough spam that I can't filter it
fairly easily by eye, but I do get enough (and a high enough ratio of
spam/nonspam) that I'm almost no longer willing to.

--tim

Raymond Wiker

unread,
Jan 20, 2004, 11:23:31 AM1/20/04
to
Tim Bradshaw <t...@cley.com> writes:

It's possible that I've been confusing genetic algorithms with
neural networks... Anyway, I was thinking of training the resulting
system in much the same way as is done with Bayesian networks.

Pascal Costanza

unread,
Jan 20, 2004, 11:33:52 AM1/20/04
to

I think that Apple Mail and Mozilla/Thunderbird have managed to make
this a relatively unobtrusive process. In my case, it took about 4 weeks
of training Apple Mail, and then it was pretty reliable in
distinguishing spam from other mail.

The trick is to make the classification as simple as clicking the delete
button, so that it fits with your regular workflow. I wouldn't have
accepted a distinct classification step.

On the other hand, I receive quite a high number of spam mails, so this
maybe made the process more effective.


Pascal

--
Tyler: "How's that working out for you?"
Jack: "Great."
Tyler: "Keep it up, then."

Ingvar Mattsson

unread,
Jan 20, 2004, 11:47:14 AM1/20/04
to
Pascal Costanza <cost...@web.de> writes:

> Tim Bradshaw wrote:
> > * Raymond Wiker wrote:
> >
> >> SpamAssassin uses a number of filters, and add up the scores
> >>for these filters. Tuning and configuration of SpamAssassin is not
> >>straightforward; I wonder if it would be possible to apply genetic
> >>algorithms to find good combinations of filters? If so, then this
> >>might be a near-perfect application for Common Lisp...
> > I think the problem is that something has to score the results, and
> > that requires a human to look at the spam and nonspam, and that
> > defeats the object. I don't get enough spam that I can't filter it
> > fairly easily by eye, but I do get enough (and a high enough ratio of
> > spam/nonspam) that I'm almost no longer willing to.
>
> I think that Apple Mail and Mozilla/Thunderbird have managed to make
> this a relatively unobtrusive process. In my case, it took about 4
> weeks of training Apple Mail, and then it was pretty reliable in
> distinguishing spam from other mail.
>
> The trick is to make the classification as simple as clicking the
> delete button, so that it fits with your regular workflow. I wouldn't
> have accepted a distinct classification step.

What I did for a while before "trusting" the classification by the
Bayesian filter was (a) classify saved non-spam mailboxes as "not
spam", then moving everything that was spam to a "spam" folder and
after a few days classifying that as "spam". Slightly more than "just
click a button", but the extra work is the typing of two commands
(sa-learn --spam --dir ../spam && rm ../spam/*).

I could, I guess, have added a "Spam" and a "Ham" button to my mail
client and done it that way, but after having spent a whole 2-3
minutes, I decided that the amortised effort probably wasn't worth
it. As usual, your mileage may vary.

//Ingvar
--
(defmacro fakelambda (args &body body) `(labels ((me ,args ,@body)) #'me))
(funcall (fakelambda (a b) (if (zerop (length a)) b (format nil "~a~a"
(aref a 0) (me b (subseq a 1))))) "Js nte iphce" "utaohrls akr")

Tage Stabell-Kulø

unread,
Jan 20, 2004, 2:43:56 PM1/20/04
to
Tim Bradshaw <t...@cley.com> writes:

> I think the problem is that something has to score the results, and
> that requires a human to look at the spam and nonspam, and that
> defeats the object.

Before I returned to read email in emacs, I used the Cloudmark
service embedded in Outlook. Basically, it lets you mark email as
spam, and send a MD5 of the email to them. All mail marked by other
as spam are moved to the smap folder.

Or, in other words, only one single user will (have to see) each
individual spam. On a normal day it would remove about 95% of all my
spam, and I'd do the remaining 5% as my part of the effort. I receive
100+ spam per day. I would go through my spam box every week before
deleting. Not even once during my years as a user did I find that
Cloudmark had been mistaken!

Cloudmark works on the assumption that the body of spam is identical
for all users. As long as it is, the approach will work if you have a
large enough user community.

According to their homepage, sendmail.com has choosen their solution
as part of their commercial offering. Unfortunately, nothing to be
bought from sendmail.com for NetBSD, so that's a no-go for me.


> --tim

[TaSK@/\\]

Thomas F. Burdick

unread,
Jan 20, 2004, 3:28:37 PM1/20/04
to
Tim Bradshaw <t...@cley.com> writes:

> * Ingvar Mattsson wrote:
>
> > FWIW, I used to "not have a spam problem" (that is, the ratio of spam
> > to legitimate email was low enough that the spam was not a
> > distraction). In the last 3-4 months, it's gone to spam being the
> > *major* part of my mailbox. Addresses ending up there have been in
> > existence (and used on Usenet) from 1993, 1996 and somewhere around
> > 1999 (and one only since last year).
>
> Me too. I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is
> real), so it essentially classifies everything as spam, which I then
> have to wade through. I may try feeding it my whole mailbox as good
> to give it some more data to learn from, but that's only a couple of
> days worth of spam now, so I'm not sure if it will work.

My problem is that it's pretty easy to get SpamAssassin to recognize
spam, but even with years of legit email to work with, I can't get a
setup that doesn't mark large amounts of my legit email as spam. I
was able to get SA to mark only about 2% of my legit email as spam,
and let through about 50% of the spam I get, but it involved filtering
for viruses and known-good senders first. So, I was able to get
things setup so I only wade through the amount of spam I did a year
ago.

I'm guessing that the writers of spam filters mostly correspond with
people who write in more-or-less complete sentances. But I don't want
to stop writing to my friends who sometimes send me emails like Yo
son, CHECK THIS SHIT OUT $$$ <url> fuk dat track is HOT!!

Thomas A. Russ

unread,
Jan 20, 2004, 2:52:24 PM1/20/04
to

There are still spammers who send mail to messages-IDs dating back to
1994, and the SGML mailboxes continue to receive lots and lots of junk
despite being discontinued in 1996. It is about one third of what my
1987-vintage University of Oslo address receives, but still, I'm quite
impressed with the persistance of these drooling idiots.

Actually, this doesn't seem all that surprising, given some of the
economics of the spam industry. The upstream suppliers like to sell
EMail address lists with the largest number of addresses in them, so
they really have absolutely NO incentive to try to cull addresses once
they have been added. As far as I know, there aren't any rating groups
out there who analyze the degree of usefulness of those address lists :)


--
Thomas A. Russ, USC/Information Sciences Institute

Petter Gustad

unread,
Jan 20, 2004, 4:58:38 PM1/20/04
to
Tim Bradshaw <t...@cley.com> writes:

> Me too. I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is

Maybe you could use comp.lang.lisp to train it?

I'm using mew (an emacs mh mail client) and it has a very handy
command to train spam (ls) and ham (lh) in conjunction with bogofilter
(Bayesian). However, the problem lately is the high ratio of Bayesian
poison that I get (which I just file in my spam folder without any
further training).

Are there any good filters for detecting the Bayesian-poison nonsense?

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Erann Gat

unread,
Jan 20, 2004, 5:33:24 PM1/20/04
to
In article <87ektus...@filestore.home.gustad.com>, Petter Gustad
<newsma...@gustad.com> wrote:

> Are there any good filters for detecting the Bayesian-poison nonsense?
>

I haven't actually tried this, but all the Bayes-poison I've gotten is
just random words, so I would think it would succumb easily to a histogram
test. I don't think you'd have to get very sophisticated. Just looking
at the ratio of short words (<= 3 characters) to the total number of words
seems like it ought to work. I predict this ratio would be around 0.1 for
non-spam, and <0.01 for Bayes-poison spam.

E.

Tim Bradshaw

unread,
Jan 20, 2004, 6:34:30 PM1/20/04
to
* Thomas F Burdick wrote:

> My problem is that it's pretty easy to get SpamAssassin to recognize
> spam, but even with years of legit email to work with, I can't get a
> setup that doesn't mark large amounts of my legit email as spam.

Right, that's my problem exactly. In fact it's not really surprising,
since well over 99% of my mail *is* spam (or was in week 3 or 2004,
anyway), so it essentially doesn't see any good mail at all. I think
that this is what will doom bayesian things (and possibly any
statistical approach): statistically *all* mail will be spam quite
soon (and it is already for some people) so any training data they get
will be hopelessly skewed. Although someone with a better grasp of
statistics than me can probably demonstrate that this is OK in fact.

--tim

Tim Lavoie

unread,
Jan 20, 2004, 6:56:09 PM1/20/04
to
>>>>> "Erann" == Erann Gat <gNOS...@jpl.nasa.gov> writes:

Erann> In article <87ektus...@filestore.home.gustad.com>,
Erann> Petter Gustad
Erann> <newsma...@gustad.com> wrote:

>> Are there any good filters for detecting the Bayesian-poison
>> nonsense?
>>

Erann> I haven't actually tried this, but all the Bayes-poison
Erann> I've gotten is just random words, so I would think it would
Erann> succumb easily to a histogram test. I don't think you'd
Erann> have to get very sophisticated. Just looking at the ratio
Erann> of short words (<= 3 characters) to the total number of
Erann> words seems like it ought to work. I predict this ratio
Erann> would be around 0.1 for non-spam, and <0.01 for
Erann> Bayes-poison spam.

Better yet, just use the filing of good or bad as further training
data. I get the occasional escapee with poisoning attempts, but it
just gets sent to the spam pile, and trained as spam for next time.

Even with poisoning, I get relatively little delivered to my main
mailbox, so the filtering is doing well. (spambayes, if it matters)

Thomas F. Burdick

unread,
Jan 20, 2004, 7:12:38 PM1/20/04
to
Tim Lavoie <tool...@spamcop.net> writes:

> Better yet, just use the filing of good or bad as further training
> data. I get the occasional escapee with poisoning attempts, but it
> just gets sent to the spam pile, and trained as spam for next time.
>
> Even with poisoning, I get relatively little delivered to my main
> mailbox, so the filtering is doing well. (spambayes, if it matters)

The problem is, you're training it to *recognize* the poison, but not
necessarily to distinguish it from less-than-coherent legit email.

Harald Hanche-Olsen

unread,
Jan 20, 2004, 6:40:48 PM1/20/04
to
+ Tim Bradshaw <t...@cley.com>:

| Me too. I've struggled with a Bayesian thing but it really doesn't
| cope: I think it is failing because there isn't enough good mail to
| train it on (I get hardly any `real' mail: probably 1% of my mail is
| real), so it essentially classifies everything as spam, which I then
| have to wade through. I may try feeding it my whole mailbox as good
| to give it some more data to learn from, but that's only a couple of
| days worth of spam now, so I'm not sure if it will work.

I have good experiences with spambayes. When I first set it up, I
trained it on about 1000 hams and several thousand spams that I had
squirreled away for just such an opportunity. Then I tested it on all
the mail I had just trained it on, and discovered about a dozen
messages misclassified in each category. Now, my daily haul is
roughly 300 spam, 10 ham, and 3 unsure. I hand classify the unsure
ones and train on them, otherwise I do no more training. In eight
months operation I had very few hams misclassified as spam. But then
I am perhaps lucky, living in the academic world. Those who live in
the business world and need to distinguish between sales pitches that
are not spam and those that are, may have less luck.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
than thinking does: but it deprives us of whatever chance there is
of getting closer to the truth. -- C.P. Snow

Tim Bradshaw

unread,
Jan 20, 2004, 6:57:28 PM1/20/04
to
* Erann At wrote:

> I haven't actually tried this, but all the Bayes-poison I've gotten is
> just random words, so I would think it would succumb easily to a histogram
> test. I don't think you'd have to get very sophisticated. Just looking
> at the ratio of short words (<= 3 characters) to the total number of words
> seems like it ought to work. I predict this ratio would be around 0.1 for
> non-spam, and <0.01 for Bayes-poison spam.

I presume that what will happen is an arms race here: the obvious
approach to dealing with this kind of random-word stuff is to look at
bigram (or in general n-gram) statistics. I'm fairly sure that the
bigram stats of the kind of bayesian-poison stuff I see is nothing
like English. So something that does bigrams should easily be able to
distinguish. But unfortunately it's just as easy to write things
which *generate* text using bigram statistics from a bunch of input
data, and of course that output will look exactly like English to a
bigram-stats filter. Anything longer than bigrams starts needing lots
of training data I think.

(Or indeed: Input data I think. Gram statistics; Just as easy to a
bigram stats filter. The bigram stats filter. Obvious approach to a
bigram statistics from a bigram statistics from a bigram statistics;
I'm fairly sure that does bigrams starts needing lots of the kind of
bayesian poison stuff I see is nothing like English.)

--tim


Pascal Bourguignon

unread,
Jan 21, 2004, 12:44:15 AM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

I've done a test with bigrams and trigrams: the gain to bigram was
sizeable, but from bigrams to trigrams, there was not much difference
on the data set I used (with mailboxes from 1000 to 4000 spam or good
messages).

--
__Pascal_Bourguignon__ http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/

Erik Naggum

unread,
Jan 21, 2004, 1:42:36 AM1/21/04
to
* Tim Bradshaw

| (Or indeed: Input data I think. Gram statistics; Just as easy to a
| bigram stats filter. The bigram stats filter. Obvious approach to a
| bigram statistics from a bigram statistics from a bigram statistics;
| I'm fairly sure that does bigrams starts needing lots of the kind of
| bayesian poison stuff I see is nothing like English.)

The natural extension of this method is to use the spell checker on
incoming mail.

--
Erik Naggum | Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

Matthias

unread,
Jan 21, 2004, 4:25:50 AM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

The concept you might be looking for is "extreme value statistics"
which is the statistics of rare events. It is used by assurance
companies to model events like earth quakes and hurricanes. Extreme
value statistics is often hard to do, because you need to model the
tails of distributions accurately. I'm no expert here, so I can't
tell more. (But google has lots of references.)

So far statistical methods seem relatively successful approaches to
model human language (compared to the alternatives). The problem with
Bayesian filters is not that they are using Bayes' rule but that,
currently, their model of human language is crude (individual words
are assumed statistically independent).

BTW: For me, the current language model works fine: My post-box
obtains approx 1000 emails a month, 2/3 of which are spam. Bogofilter
does a very nice job with filtering (much better than spamassassin's
set of rules). OTH, I don't trust it enough to put my e-mail address
on usenet. ;)

Ingvar Mattsson

unread,
Jan 21, 2004, 5:32:50 AM1/21/04
to
Petter Gustad <newsma...@gustad.com> writes:

> Tim Bradshaw <t...@cley.com> writes:
>
> > Me too. I've struggled with a Bayesian thing but it really doesn't
> > cope: I think it is failing because there isn't enough good mail to
> > train it on (I get hardly any `real' mail: probably 1% of my mail is
>
> Maybe you could use comp.lang.lisp to train it?
>
> I'm using mew (an emacs mh mail client) and it has a very handy
> command to train spam (ls) and ham (lh) in conjunction with bogofilter
> (Bayesian). However, the problem lately is the high ratio of Bayesian
> poison that I get (which I just file in my spam folder without any
> further training).
>
> Are there any good filters for detecting the Bayesian-poison nonsense?

Grave mismatch between the text/plain and the text/html non-marked-up
text? That takes care of stuff with "You have received a mail in HTML"
too (if people can't write plain text emails, I don't want them).

//Ingvar
--
(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
(loop for x being the external-symbols in :cl collect (string x)) #'string<))

Tim Bradshaw

unread,
Jan 21, 2004, 5:48:36 AM1/21/04
to
* Pascal Bourguignon wrote:
> I've done a test with bigrams and trigrams: the gain to bigram was
> sizeable, but from bigrams to trigrams, there was not much difference
> on the data set I used (with mailboxes from 1000 to 4000 spam or good
> messages).

This is very likely because you don't have enough data to get
reasonable trigram stats: you really need a lot.

--tim

Tim Bradshaw

unread,
Jan 21, 2004, 5:59:57 AM1/21/04
to
* Erik Naggum wrote:

> The natural extension of this method is to use the spell checker on
> incoming mail.

I think I may be confused by what people mean by `Bayesian poison'.
Is it all the stuff which has non-words in it, like v1agra &c? In
that case I think that something which had a decent sized dictionary
and looked for more than n% known words would probably be good.

The stuff I was worrying about was things which have lots of random,
English, words in, possibly with correct single-word stats, so
anything that uses a completely naive single-word model of language
will assume it's English, but the bigram stats will be completely
mutant. A spelling checker doesn't help with this, I think. I also
think an ngram thing can be fooled because it's so easy to generate.

Another approach might be to use a PoS tagger, and then look at the
nouns to see what *they* look like (or may be other parts of speach).

But even here, I don't know. I have seen (or possibly dreamed) stuff
which has great chunks of out-of-copyright novels in, but is
constructed such that if you look at it with whatever tool its aimed
at, you don't see that (sometimes because its comments, but sometimes
because its white text on a white background). So a tool needs to be
able to work out what IE would do with it, which is a hideous
problem. (Actually, its an easy problem: it it's not plain text, it's
spam. So spam has essentially destroyed any kind of rich content in
email).

--tim

Espen Vestre

unread,
Jan 21, 2004, 6:46:08 AM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

> I think I may be confused by what people mean by `Bayesian poison'.
> Is it all the stuff which has non-words in it, like v1agra &c?

Hmm? I thought it was all the stuff that had real words in it?

Here's a cut and paste from a recently received spam:

"album howe vindicate arabic deafen decay twiddle hierarchy smalley
backlash luminous midwestern bivariate abstracter kinshasha negotiable
lawbreaker centerline backspace cranky struck consumption"

--
(espen)

Christophe Rhodes

unread,
Jan 21, 2004, 7:04:44 AM1/21/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

And my impression was that this isn't designed to defeat Bayesian spam
detectors, because the above words in any individual's corpus of ham
will feature very rarely if at all.

A reasonable assumption might be that one or two of these words have
appeared in a small number of ham messages. They will therefore
contribute a small amount to the haminess of the message. However,
the payload will generally have many strong indications of spam, and
consequently, at least here, the message is still filed as spam.

This filing as spam, in addition, retrains the Bayesian filter, such
that e.g. "hierarchy" becomes a slightly less strong indication of
spam. So be it. If "hierarchy" is a commonly-used word in your ham
(as it has been for me, as I did some work on the cosmological
hierarchy problem -- which incidentally means that "heirarchy" is also
a ham word for me... :-) then this slight retraining has no effect on
truly ham messages, which are still classified as ham; nor does it
have any effect on the next set of random words in spam, which is
unlikely to use "hierarchy" again. So all in all, these random words
don't seem designed to defeat adaptive Bayesian filters.

What they do defeat, of course, is a SpamAssassin rule such as one
matching a MIME message with no text/plain component, without
triggering a rule detecting obvious spam words. If I were attempting
to defeat Bayesian filters, more common words would seem more likely
to act as poison.

Christophe
--
http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757
(set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b)))
(defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)

Pascal Bourguignon

unread,
Jan 21, 2004, 7:09:43 AM1/21/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

Strangely enough, none of these words appear in my normal email.
Perhaps if I was a nuclear physicist, "decay" would appear sometimes...

Hannah Schroeter

unread,
Jan 21, 2004, 8:30:17 AM1/21/04
to
Hello!

Espen Vestre <espen@*do-not-spam-me*.vestre.net> wrote:
>[...]

>"album howe vindicate arabic deafen decay twiddle hierarchy smalley
>backlash luminous midwestern bivariate abstracter kinshasha negotiable
>lawbreaker centerline backspace cranky struck consumption"

In fact, on my filter database, those words would end up as
spamwords, too, as they are quite randomly selected while my
hamwords are of course adapted to the mail traffic *I* individually
receive (a great part German words, but in both English and German,
biased through my interests).

And even if every piece of spam contains different such words,
they are at least not strongly "ham", and don't influence the
final decision, as that is drawn from the most significant
words.

Kind regards,

Hannah.

Joe Marshall

unread,
Jan 21, 2004, 10:47:34 AM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

> I presume that what will happen is an arms race here: the obvious
> approach to dealing with this kind of random-word stuff is to look at
> bigram (or in general n-gram) statistics.

It depends on the hack they are using.

Some spam I get has random character strings in it like this:

Bad Credit is OK Gold Visa Card rjkarcmrosfn f

others have pseudowords:

astrologum Palatioque big chance

some have random words:

Authors, you decide ........picosecond
why wait? ................... dakar

some seem to take a sample of canned text:

Rolex-Italian crafted from $65-$65-$140 Free
Ship<!--extravagance. And when did you arrive here? inquired
she. -->

and some seem to sample text from other sources:

Bechtel nod to L&T raises Indian hopes on Iraq projects


I have a regular expression that matches non-english trigrams. It
finds the random character strings without a problem. Pseudowords
tend to have non-english trigrams in them (for instance the trigram
"ioq" is not common in English). The other stuff gets more difficult.

The random words should be relatively easy statistically. While I
might actually talk about picoseconds, the bulk of the words in the
english language are not ones I commonly use. (How often do you write
about hippodromes, speakeasys, and peristalses?)

Joe Marshall

unread,
Jan 21, 2004, 11:00:08 AM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

> Me too. I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is

> real), so it essentially classifies everything as spam, which I then
> have to wade through. I may try feeding it my whole mailbox as good
> to give it some more data to learn from, but that's only a couple of
> days worth of spam now, so I'm not sure if it will work.

A pre-filter before the Bayesian filter is helpful. Bayesian filters
tend to `latch on' to things that are easy to find statistically to
the detriment of those things that are more difficult. There is no
need to train the filter to remove email from the .BIZ domain, or
email that claims to be a reply, but doesn't have a `References'
header.

In order to get the best results you need to have a model of spam and
ham. It would be worthwhile to determine how the various popular spam
engines work so as to detect the engine itself rather than the message
within the spam. For instance, one spam engine generates random
pseudohtml: <oeaun><snauthsbm>

Another inserts text in the middle of words: V<!-- oblong -->ia<!--
interest -->gra

You don't need to statistically match these, just detect them.

Ray Dillinger

unread,
Jan 21, 2004, 12:33:48 PM1/21/04
to
Matthias wrote:
>
> statistical methods seem relatively successful approaches to
> model human language (compared to the alternatives). The problem with
> Bayesian filters is not that they are using Bayes' rule but that,
> currently, their model of human language is crude (individual words
> are assumed statistically independent).
>
> BTW: For me, the current language model works fine: My post-box
> obtains approx 1000 emails a month, 2/3 of which are spam. Bogofilter
> does a very nice job with filtering (much better than spamassassin's
> set of rules). OTH, I don't trust it enough to put my e-mail address
> on usenet. ;)

These days I run a few simple checks on my mail -- basically
whitelists for known senders -- and then hand it over to ASSP's
bayesian filter.

ASSP, and several other new-generation filters, build a spamdb
of word pairs rather than individual words. It makes the database
bigger, but hard drive space is cheap.

Some links:

http://spamprobe.sourceforge.net/
http://assp.sourceforge.net/

The only thing that some people might dislike about it is that if
someone uses a free service that attaches a lot of spam to every
message that comes through (like Yahoo) it will treat their messages
as spam. I am happy with this behavior and do not correct it.

I use ASSP. I've had the same email address for ten years. I post
to usenet using it. 'nuff said?

Bear

Petter Gustad

unread,
Jan 21, 2004, 1:25:58 PM1/21/04
to
Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

This was my assumption too, even though I have to admit that I've never
seen a definition of the term.

I thought the spammers would put in a spam word within a large number
of ham words in order to cause auto-learn filters decrease the spam
score for the given spam word. Is my assumption wrong?

Tim Bradshaw

unread,
Jan 21, 2004, 1:47:16 PM1/21/04
to
* Joe Marshall wrote:

> In order to get the best results you need to have a model of spam and
> ham. It would be worthwhile to determine how the various popular spam
> engines work so as to detect the engine itself rather than the message
> within the spam. For instance, one spam engine generates random
> pseudohtml: <oeaun><snauthsbm>

Yes, the problem is that you need something that just works, and keeps
working - you can spend time implementing stuff and chasing upgrades
&c &c, but if you do that, then the spammers have won, because they've
made email really expensive in terms of time. Probably the solution
(for me, anyway) is just to outsource the whole thing to someone who
will do the detection &c, and can be good, because they can afford to
spend their lives on it. I guess there are services that can do this.

--tim

Tim Lavoie

unread,
Jan 21, 2004, 3:40:16 PM1/21/04
to
>>>>> "Thomas" == Thomas F Burdick <t...@famine.OCF.Berkeley.EDU> writes:

Thomas> The problem is, you're training it to *recognize* the
Thomas> poison, but not necessarily to distinguish it from
Thomas> less-than-coherent legit email.

Perhaps, but it does work well. Maybe I don't attract enough
incoherent (yet legit) email to probe that case. Since no system is
100%-effective, I still drop the alleged spam into its own
mailbox. Once I skim it for errors, it all gets used for Spamcop
reporting and filter training.

In any case, how much incoherent email do I really want? :)

Tim

Klaus Harbo

unread,
Jan 21, 2004, 4:17:16 PM1/21/04
to

Spam is one the most annoying features of today's Internet.

I have been a very happy user of TMDA (Tagged Message Delivery Agent)
for almost a year now. TMDA has effectively reduced the amount of
spam I receive from many hundreds per week (my personal high was 300 a
day over the course of more than two weeks) to essentially zero. TMDA
relies on white lists and automated responses to mails from unknown
senders. Check out www.tmda.net. Written in Python, btw.

-Klaus.

Joe Marshall

unread,
Jan 21, 2004, 6:58:57 PM1/21/04
to
Tim Bradshaw <t...@cley.com> writes:

> Yes, the problem is that you need something that just works, and keeps
> working - you can spend time implementing stuff and chasing upgrades
> &c &c, but if you do that, then the spammers have won, because they've
> made email really expensive in terms of time.

I don't think you can characterize this as the spammers having `won'.
They `win' when they get even one response from the multiple millions
of email they send. The people that have to clean up after them are
of no concern to them one way or the other.

--
~jrm

Erik Naggum

unread,
Jan 21, 2004, 9:13:08 PM1/21/04
to
* Klaus Harbo

| TMDA has effectively reduced the amount of spam I receive from many
| hundreds per week (my personal high was 300 a day over the course
| of more than two weeks) to essentially zero. TMDA relies on white
| lists and automated responses to mails from unknown senders.

While this probably works to keep unwanted initial contacts out of
your mailbox, e-mail works so well because it is a very convenient
initial contact medium. The threshold to making the first contact
with a relative stranger is already high, and automated answer that
requires manual intervention is worse than calling people you don't
know and only getting their answering machine telling you that they
screen their calls. I hate answering marchines and don't use them,
and I tend to hang up and try again later if I get one. (For this
reason, I really love SMS, which I also send before I call people
to tell them what I'm going to call about and when. This habit was
picked up by a number of lawyers I know, who say it saves them and
their clients a lot of time.) I have to admit that I have received
TMDA-style responses from people I have tried to communicate with
and just found that it was not worth it. This is a false negative
that I think argues very strongly against TMDA-style solutions. If
something like this is necessary, it should be handled by the mail
clients automatically and not involve human time at all. I have
not seen any attempt to do this, but would love to hear about it.

Joe Marshall

unread,
Jan 21, 2004, 10:59:32 PM1/21/04
to
Erik Naggum <er...@naggum.no> writes:

> (For this
> reason, I really love SMS, which I also send before I call people
> to tell them what I'm going to call about and when.


I'm not familiar with SMS. What is it?


--
~jrm

Georges Ko

unread,
Jan 21, 2004, 11:57:53 PM1/21/04
to
Klaus Harbo <kl...@harbo.net> writes:

I use white lists as well, but I leave the door open by scanning
the subject for some keyword so that legitimate mail sent by unlisted
people doesn't end up in "nnml:mail.misc" among the 2-300 daily
spams, which I manually check and clean every 3-4 days so that
legitimate mail without the keyword still has some chance.

So, when I give my email address to someone, I either ask his
email address (to be added in white list) or tell him to add the
keyword in the subject of his mail (keyword included in signature,
written next to email in business card, etc...).

Maybe mail client implementors could agree to provide an
additional field, called for example "Key", among "To", "Subject",
etc. in their programs (integrated with address books, etc..) that
would work this way. People would publish their address and key and it
would be the email address harvestors' work to map the addresses and
the keys, which would be nearly not workable, because the key could be
provided by saying "My key is my last name in lower case".
--
Georges Ko g...@gko.net 2004-01-22
If you are not in my white list, add [m2gko] in the subject of your mail.

Erik Naggum

unread,
Jan 22, 2004, 12:14:18 AM1/22/04
to
* Joe Marshall

| I'm not familiar with SMS. What is it?

Short Message Service. GSM cell phones are equipped with the ability
to send and receive text messages, like two-way text pagers, which
were never popular in countries where GSM is. Originally designed to
send GSM phones configuration information and even software updates
from the operator, SMS messages come in many types, but only innocuous
text messages (at least as far as modifying the phone's configuration
is concerned) are sendable from GSM phones, and it was never intended
to be an integral part of the GSM offering. (Other types are sent by
operators and include ringing tones, images to fit the small display
on the phone, etc, to customize phones, which is surprisingly popular
and profitable.) GSM phones can send a request to a provider and pay
for the returned object. This is even used for directory assistance
and a host of other and very useful services, too numerous to mention.
To give you a hint of the popularity of this service, the largest GSM
provider in Norway, which serves about 2 million customers, served
more than 2 billion SMS messages in 2003. It is not uncommon for a
GSM phone user here to spend more money on SMS'es than on phone calls.
Many companies make a healthy living solely from the SMS market, and
we even have 3 TV stations that host night-time chat and music shows
where people send in SMS'es at a hefty cost (easily equal in cost to a
10-minute phone call) to vote for the music they want to hear and a
lot of other things. Serious information providers also charge for
broadcasting selected news items, such as financial market alerts.
Some newspapers and TV stations offer TV addicts alerts on upcoming TV
programs.

SMS'es are restricted to 160 characters in length and they are often
hard to type, requiring up to six presses in a row on a single key,
which has produced new languages that omit characters, and even the
use of dictionary algorithms that find matching words for the word
typed with only one keypress, such as the T9 algorithm, causing the
coinage of the word "teenineonyms", words that have the same sequence
of keys. A version of the Bible was published some time ago in the
peculiar compressed language of teenage SMS messages. Translation
dictionaries come with tables of common compressions: THX (thanks) to
MR6 (merci) or RUBZ (are you busy?) to TOQP (t'es occupé?) or ROFL
(rolling on the floor laughing) to MDR (mort de rir). You get the
idea -- our cultures have never seen such rapid change in the way we
express ourselves as the 160-character SMS messages affords. All
Europeans are familiar with this phenomenon, and SMS'es are used by
absolutely every owner of cell phones. Some hospitals and doctors use
them to optimize their limited resources while avoiding long waits in
their waiting rooms and wasting time because of absenteeism.

Think of it as portable e-mail available to the entire population.

Rob Warnock

unread,
Jan 22, 2004, 4:08:11 AM1/22/04
to
Tage Stabell-Kulø <ta...@ifi.uit.no> wrote:
+---------------
| Cloudmark works on the assumption that the body of spam is identical
| for all users. As long as it is, the approach will work if you have a
| large enough user community.
+---------------

Unfortunately, that assumption stopped being true about 3-4 years ago.
Spammers started putting in per-message random junk, which blows Cloudmark,
DCC, and other similar schemes completely out of the water. (*sigh*)


-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Tim Bradshaw

unread,
Jan 22, 2004, 6:41:37 AM1/22/04
to
* Joe Marshall wrote:

> I don't think you can characterize this as the spammers having `won'.
> They `win' when they get even one response from the multiple millions
> of email they send. The people that have to clean up after them are
> of no concern to them one way or the other.

By `won' I meant `destroyed email as a useful communications medium',
which I agree isn't what they'd probably count as winning, sorry.

--tim

Tim Bradshaw

unread,
Jan 22, 2004, 6:51:29 AM1/22/04
to
* Erik Naggum wrote:

> [excellent description of SMS elided]

> Think of it as portable e-mail available to the entire population.

One of the crucial things about SMS is *it costs money to send*. So
you get rather little SMS spam (you do get some, but not much). I'm
afraid I think that *charging the sender* is probably the best
solution for spam there is. If I had to give up email or SMS now, I'd
give up email without a second thought.

--tim

Tage Stabell-Kulø

unread,
Jan 22, 2004, 10:23:47 AM1/22/04
to
rp...@rpw3.org (Rob Warnock) writes:

> Tage Stabell-Kulø <ta...@ifi.uit.no> wrote:
> +---------------
> | Cloudmark works on the assumption that the body of spam is identical
> | for all users. As long as it is, the approach will work if you have a
> | large enough user community.
> +---------------

> Unfortunately, that assumption stopped being true about 3-4 years ago.

Well, I used Clouddmark up until mid December. I hadn't paid so much
attention to spam since I received only about 10 per day. When I had
to let go of Cloudmark I was flooded by 100-200 spams every day; I
simply hadn't realized how effective Cloudmark was.

I believe that for the moment, sending 1 email to N addresses (where N
>> 1000) takes an order or two less time than sending N * (1 email to
1 address). So although in the long run it might be a dead end, it
worked amazingly well four weeks ago.

Notice that if sendmail (and replacements) by default had a 30 seconds
delay, say, as part of the implementation of SMTP, Cloudmark and
similar approaches would continue to work. Time is not free, and by
enforcing the /sender/ to stay connected longer migth be the simplest
and most effective way to incur a real cost on the sending side.


> -rob

[TaSK@/\\]


//// Tage Stabell-Kulø | email: ta...@ifi.uit.no////
/// Dept. Computer Science | Phone +47 7764 4032 ///
// University of Tromsø | Fax: +47 7764 4580 //
/ 9037 Tromsø, Norway | http://www.ifi.uit.no /

Hannah Schroeter

unread,
Jan 22, 2004, 11:31:57 AM1/22/04
to
Hello!

Tage Stabell-Kulø <ta...@ifi.uit.no> wrote:
>[...]

>Notice that if sendmail (and replacements) by default had a 30 seconds
>delay, say, as part of the implementation of SMTP, Cloudmark and
>similar approaches would continue to work. Time is not free, and by
>enforcing the /sender/ to stay connected longer migth be the simplest
>and most effective way to incur a real cost on the sending side.

High traffic (legitimate) mail servers, such as relays of big
providers, will be highly delighted, similar to protocol extensions
like "hash cash" (force the sender to do a calculation for a few
hundred ms or even more, per mail).

Kind regards,

Hannah.

Hannah Schroeter

unread,
Jan 22, 2004, 11:40:35 AM1/22/04
to
Hello!

Tim Bradshaw <t...@cley.com> wrote:
>[...]

>One of the crucial things about SMS is *it costs money to send*.

Especially, it costs *much* money in the usual pricing schemes here
in Germany, often 19 Euro-cent per message, for that, I can have up
to more than 6 minutes of voice connection from the mobile phone to a
local wire phone destination, or 1 minute to a mobile phone except
on business hours - and you can have much interaction in 1 minute
of telephoning! That makes SMS near to useless for me, especially
as replying costs that money, too, while replying to what I say
in a phone call doesn't cost the other side additional money.

Enough ranting.

Kind regards,

Hannah.

Jens Axel Søgaard

unread,
Jan 22, 2004, 1:21:17 PM1/22/04
to
Hannah Schroeter wrote:
> Especially, it costs *much* money in the usual pricing schemes here
> in Germany, often 19 Euro-cent per message, ...

Ouch.

In the Denmark the high pricing opened the tele market for discount
companies. The one thing that made the youngerst change was low
pricing on SMS messages. For some time the "old" companies just
went with the "quality and proce match", but in the end they too
were forced to make dicount brands.

A typical discount brand has no monthly fee, cheap SMS and
the only way to pay is advance paymanet using "VISA" over the
internet.

A typical example (the discount brand of TDC):

<http://www.mixit.dk/>

One SMS costs 20 øre = 0.20 danish kroner.
(And one euro costs 7,44 kroner).

When you sign up, you get 400 SMS for free.

--
Jens Axel Søgaard

Tim Bradshaw

unread,
Jan 22, 2004, 12:52:09 PM1/22/04
to
* Hannah Schroeter wrote:

> Especially, it costs *much* money in the usual pricing schemes here
> in Germany, often 19 Euro-cent per message, for that, I can have up
> to more than 6 minutes of voice connection from the mobile phone to a
> local wire phone destination, or 1 minute to a mobile phone except
> on business hours - and you can have much interaction in 1 minute
> of telephoning! That makes SMS near to useless for me, especially
> as replying costs that money, too, while replying to what I say
> in a phone call doesn't cost the other side additional money.

I think it's fairly clear that SMS is overpriced at present, yes. I
think when I first had a (prepay) mobile it was something like 4p a
message, and it's now 12p. This is especially amazing since it's
unbelievably cheap to send - SMS essentially is sent in the metadata
that would normally be involved in a voice call. And of course it
doesn't have any of the real-time constraints that voice does, so it
can be sent in the times when the network is otherwise idle.

I think the reason for this is fairly clear - mobile companies have
enormous financial problems (no one wants 3G), and SMS is terribly
popular, so they can crank up the costs to keep themselves afloat.
What will happen at some point, in the UK anyway, is that one of the
operators will go under, and someone will buy their infrastructure and
can then move to a reasonable pricing structure, since their enormous
debts will have been written off. This will, of course, bankrupt the
other operators fairly rapidly...

Curiously, although my partner and I have phone tariffs which include
nearly 2 hrs of off-peak calls a day to each other in total, we still
use SMS a whole lot (though not like lots of people do). One issue is
that it's non-intrusive - you can read and answer an SMS later, which
you can't do for a phone call (well, you can do the whole voicemail
nightmare...)

Now, what does this have to do with Lisp...

--tim

Joe Marshall

unread,
Jan 22, 2004, 1:46:59 PM1/22/04
to

> Hannah Schroeter wrote:
>> Especially, it costs *much* money in the usual pricing schemes here
>> in Germany, often 19 Euro-cent per message, ...

Jens Axel Søgaard <use...@jasoegaard.dk> writes:

> A typical example (the discount brand of TDC):
>
> <http://www.mixit.dk/>
>
> One SMS costs 20 øre = 0.20 danish kroner.
> (And one euro costs 7,44 kroner).
>
> When you sign up, you get 400 SMS for free.

How does the spam rate compare? You don't want the messages to get
*too* inexpensive.

Torsten Poulin

unread,
Jan 22, 2004, 2:40:51 PM1/22/04
to
Joe Marshall wrote:

> How does the spam rate compare? You don't want the messages to
> get *too* inexpensive.

In six years, I have recieved three or four unsolicited SMS
messages. Domestic spam is practically non-existent in Denmark.

According to section 6a of the Danish marketing law, it is
illegal for a business or tradesman to contact you using e-mail,
automatic dialing/calling systems, or fax, with the intend of
selling commodities or services (etc.), unless you have requested
it yourself.

--
Torsten

Jens Axel Søgaard

unread,
Jan 22, 2004, 3:20:17 PM1/22/04
to
Torsten Poulin wrote:
> Joe Marshall wrote:

>>How does the spam rate compare? You don't want the messages to
>>get *too* inexpensive.

> In six years, I have recieved three or four unsolicited SMS
> messages. Domestic spam is practically non-existent in Denmark.

Same experience here.

> According to section 6a of the Danish marketing law, it is
> illegal for a business or tradesman to contact you using e-mail,
> automatic dialing/calling systems, or fax, with the intend of
> selling commodities or services (etc.), unless you have requested
> it yourself.

And it is enforced too. This week the company Aircom got
a 400,000 kroner fine for sending 7500-15000 fax messages.

Earlier this year a company send 156 spam emails. The
fine was 15,000 kroner.


<http://www.jp.dk/itogc/artikel:aid=2226124/>

--
Jens Axel Søgaard

Thomas F. Burdick

unread,
Jan 22, 2004, 5:53:00 PM1/22/04
to
Erik Naggum <er...@naggum.no> writes:

> * Joe Marshall
> | I'm not familiar with SMS. What is it?
>
> Short Message Service.

Sounds like text messages, here. They haven't completely displace
2way pagers, but almost. The usual cost is 1 min per message. And I
hardly get any txtm spam.

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Raffael Cavallaro

unread,
Jan 23, 2004, 12:30:10 AM1/23/04
to
Tim Bradshaw <t...@cley.com> wrote in message news:<ey31xpu...@cley.com>...

> I've struggled with a Bayesian thing but it really doesn't
> cope: I think it is failing because there isn't enough good mail to
> train it on (I get hardly any `real' mail: probably 1% of my mail is
> real), so it essentially classifies everything as spam, which I then
> have to wade through.

It might be useful to know that:
<http://spamassassin.org/publiccorpus/>
contains vast amounts of both ham and spam email for training your
filters. Before using it, I also suffered somewhat from the
insufficient ham problem. Since feeding the public corpus ham and spam
to my spam filter, I've had no problems at all with either false
positives, or false negatives.

Ingvar Mattsson

unread,
Jan 23, 2004, 4:37:46 AM1/23/04
to
Tim Bradshaw <t...@cley.com> writes:

[ On SMS ]


> Now, what does this have to do with Lisp...

160 characters is a bit short for some Lisp programs, but one can make
an amazing amount of silly string processing in 320...

//ingvar
--
(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
(loop for x being the external-symbols in :cl collect (string x)) #'string<))

Adam Warner

unread,
Jan 23, 2004, 5:50:23 AM1/23/04
to
Hi Ingvar Mattsson,

> 160 characters is a bit short for some Lisp programs, but one can make
> an amazing amount of silly string processing in 320...
>
> //ingvar
> --
> (defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
> )))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
> (p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
> (loop for x being the external-symbols in :cl collect (string x)) #'string<))

That is wonderfully obfuscated. It's actually only 309 characters.
Here's a 289 character version:

(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car

i)d)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ" ")(p(cdr
i)d))))(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8
8)(sort(let(x)(do-external-symbols(s"CL")(push(string s)x))x)#'string<))

I reduced characters by:
- Replacing #\space with" ".
- Using newlines in place of what would otherwise be mandatory spaces.
- Replacing (loop for x being the external-symbols in :cl collect (string x))
with (let(x)(do-external-symbols(s"CL")(push(string s)x))x)

I couldn't make any use of #n. (#1=princ(elt(string(car d))l))(p(cddr i)d)))(t(#1#" ")
turns out to consume 1 extra character but it might be beneficial just for its
obfuscation value.

My version breaks CLISP 2.32's interactive environment. It doesn't handle
evaluating the function P while continuing to read partial input. When it
gets to line 4 it thinks there is an unbalanced closing bracket.

Can anyone better 289? BTW I've already tried replacing COND with IF but it
doesn't help because a PROGN is required.

Have fun,
Adam

Ingvar Mattsson

unread,
Jan 23, 2004, 6:02:29 AM1/23/04
to
Adam Warner <use...@consulting.net.nz> writes:

> Hi Ingvar Mattsson,
>
> > 160 characters is a bit short for some Lisp programs, but one can make
> > an amazing amount of silly string processing in 320...
> >
> > //ingvar
> > --

[ SNIP code ]


> That is wonderfully obfuscated. It's actually only 309 characters.
> Here's a 289 character version:

[ SNIP code ]


> I reduced characters by:
> - Replacing #\space with" ".
> - Using newlines in place of what would otherwise be mandatory spaces.
> - Replacing (loop for x being the external-symbols in :cl collect (string x))
> with (let(x)(do-external-symbols(s"CL")(push(string s)x))x)
>

One of the reasons for the #\space is that it makes lines 1&2 equally
long and lines 3&4 equally long. It's not only code, it is also
presentation.

//Ingvar
--
(defmacro fakelambda (args &body body) `(labels ((me ,args ,@body)) #'me))
(funcall (fakelambda (a b) (if (zerop (length a)) b (format nil "~a~a"
(aref a 0) (me b (subseq a 1))))) "Js nte iphce" "utaohrls akr")

Adam Warner

unread,
Jan 23, 2004, 6:19:24 AM1/23/04
to
Hi Ingvar Mattsson,

> One of the reasons for the #\space is that it makes lines 1&2 equally
> long and lines 3&4 equally long. It's not only code, it is also
> presentation.

With a little readjustment, 291 characters (including three newlines):

(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car

i)d)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ" ")(p(cdr i)d))
))(p '(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8


8)(sort(let(x)(do-external-symbols(s"CL")(push(string s)x))x)#'string<))

Regards,
Adam

Alain Picard

unread,
Jan 23, 2004, 7:10:54 AM1/23/04
to

Tim Bradshaw <t...@cley.com> writes:

> So spam has essentially destroyed any kind of rich content in
> email.

Well, we can be thankful for small favors. :-)

Rob Warnock

unread,
Jan 23, 2004, 8:17:06 AM1/23/04
to
Tage Stabell-Kulø <ta...@ifi.uit.no> wrote:
+---------------
| rp...@rpw3.org (Rob Warnock) writes:
| > Unfortunately, that assumption stopped being true about 3-4 years ago.
|
| Well, I used Clouddmark up until mid December. I hadn't paid so much
| attention to spam since I received only about 10 per day. When I had
| to let go of Cloudmark I was flooded by 100-200 spams every day; I
| simply hadn't realized how effective Cloudmark was.
|
| I believe that for the moment, sending 1 email to N addresses (where N
| >> 1000) takes an order or two less time than sending N * (1 email to
| 1 address). So although in the long run it might be a dead end, it
| worked amazingly well four weeks ago.
+---------------

Then you've been lucky. Your assumption of "sending 1 email to N addresses"
is no longer true for the worst spam, which is now using "innocent"[1]
systems hijacked by viruses/worms to do their SMTP sending for them,
thus becoming a form of DDoS[2] attack by thousands of machines which
can easily spend the tiny amount of time needed to "customize" (randomize)
each message.

An examination of my current inbox confirms this. Where it was previously
common to find spam with dozens or even hundreds of addressees per message
[with my address burried in the middle of other addresses], now almost all
of them have only one recepient: me.


-Rob

[1] If every Microsoft user whose machine was used by a virus to attack
or send spam to another machine was fined, say, $1000.00 per outgoing
connection, betcha that stuff would stop quickly enough, eh? But in
the current world, "it ain't gonna happen"... (*sigh*)

[2] Distributed Denial of Service.

Tim Bradshaw

unread,
Jan 23, 2004, 9:06:12 AM1/23/04
to
* Rob Warnock wrote:

> An examination of my current inbox confirms this. Where it was previously
> common to find spam with dozens or even hundreds of addressees per message
> [with my address burried in the middle of other addresses], now almost all
> of them have only one recepient: me.

I don't recall seeing large address lists very often - I think what I
used to see was stuff sent to some completely bogus recipient, and
presumably BCCd to lots of people, me included. However most of what
I now see is indeed sent directly to me (or actually large chunks of
it goes to webmaster/postmaster &c and gets to me that way), so I
think it is all coming from compromised PCs.

--tim

Harald Hanche-Olsen

unread,
Jan 23, 2004, 4:24:11 PM1/23/04
to
+ t...@famine.OCF.Berkeley.EDU (Thomas F. Burdick):

| Erik Naggum <er...@naggum.no> writes:
|
| > Short Message Service.
|
| Sounds like text messages, here. They haven't completely displace
| 2way pagers, but almost.

They have here. Nationwide paging service was terminated last
September, and SMS has taken over.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
than thinking does: but it deprives us of whatever chance there is
of getting closer to the truth. -- C.P. Snow

Ng Pheng Siong

unread,
Jan 24, 2004, 9:42:04 AM1/24/04
to
According to Raffael Cavallaro <raf...@mediaone.net>:
> [Since teaching my filters properly,] I've had no problems at all with

> either false positives, or false negatives.

Serious questions: What is the benefit? That you no longer see spam in your
inbox? Do your filters put spam into its own folder or redirect to
/dev/null pronto? If the former, do you check the spam folder periodically?

I've been doing things the other way round: Filter all known stuff into
their own folders. What's left is either spam, bounces from forgeries, or
directly-addressed good mail. The junk stuff is easily identified visually
and as easily deleted. (Although it's getting trickier...)

I haven't been motivated enough to try out any spam filter, be it server-
or client-based. I'm curious if I'm missing out on something.

Thanks. Cheers.

--
Ng Pheng Siong <ng...@netmemetic.com>

http://firewall.rulemaker.net -+- Firewall Change Management & Version Control
http://sandbox.rulemaker.net/ngps -+- Open Source Python Crypto & SSL

Pekka P. Pirinen

unread,
Jan 26, 2004, 9:55:00 AM1/26/04
to
Adam Warner <use...@consulting.net.nz> writes:
> With a little readjustment, 291 characters (including three newlines):

With a lot more readjustment, 240 characters (not including the
newlines):

(let(x(i'(1 3 1 4 1 6 0()0 5 9()7 10 0 7 0 8 0 9 0()2 6 0 0 12 4 23 4 1 4 8 8)))
(defun r(d l)(princ(if l(elt(car d)l)" "))(if i(r(nthcdr(pop i)d)(pop i))(terpri
)))(r(nthcdr 76(sort(do-external-symbols(s"CL"x)(push(string s)x))#'string<))2))

It's a neat idea for a sig, underlining the stability of the standard.

I also like bignum encoding, just for the joy of using MULTIPLE-VALUE-CALL:
--
Pekka P. Pirinen (defun b (d c)
(princ (code-char (+ 32 c)))
(if (> d 0) (multiple-value-call #'b (floor d 54)) (terpri)))
(b 265038806351786925053937031127 42)

Adam Warner

unread,
Jan 26, 2004, 4:43:13 PM1/26/04
to
Hi Pekka P. Pirinen,

> With a lot more readjustment, 240 characters (not including the
> newlines):
>
> (let(x(i'(1 3 1 4 1 6 0()0 5 9()7 10 0 7 0 8 0 9 0()2 6 0 0 12 4 23 4 1 4 8 8)))
> (defun r(d l)(princ(if l(elt(car d)l)" "))(if i(r(nthcdr(pop i)d)(pop i))(terpri
> )))(r(nthcdr 76(sort(do-external-symbols(s"CL"x)(push(string s)x))#'string<))2))

Superb! 3 x 80 is a lovely outcome and superior to 4 x 72.

> It's a neat idea for a sig, underlining the stability of the standard.
>
> I also like bignum encoding, just for the joy of using MULTIPLE-VALUE-CALL:
> --
> Pekka P. Pirinen (defun b (d c)
> (princ (code-char (+ 32 c)))
> (if (> d 0) (multiple-value-call #'b (floor d 54)) (terpri)))
> (b 265038806351786925053937031127 42)

Impressive encoding. Thanks for the lesson in using MULTIPLE-VALUE-CALL.
Until now I would have written (multiple-value-call #'b (floor d 54)) as
(multiple-value-bind (quo rem) (floor d 54) (b quo rem)).

Regards,
Adam

Pascal Bourguignon

unread,
Jan 26, 2004, 6:49:04 PM1/26/04
to
Pekka.P...@globalgraphics.com (Pekka P. Pirinen) writes:
> I also like bignum encoding, just for the joy of using MULTIPLE-VALUE-CALL:
> --
> Pekka P. Pirinen (defun b (d c)
> (princ (code-char (+ 32 c)))
> (if (> d 0) (multiple-value-call #'b (floor d 54)) (terpri)))
> (b 265038806351786925053937031127 42)

I prefer one liners:

(princ(substitute #\Space #\0(format()"~36R"5688852237040631986030796883)))


--
__Pascal_Bourguignon__ http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/

Russell Wallace

unread,
Feb 1, 2004, 1:39:05 PM2/1/04
to
On 27 Jan 2004 00:49:04 +0100, Pascal Bourguignon
<sp...@thalassa.informatimago.com> wrote:

>I prefer one liners:
>
>(princ(substitute #\Space #\0(format()"~36R"5688852237040631986030796883)))

I'm curious, what does this do? Do all the programs in this thread do
the same thing?

--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace

Gareth McCaughan

unread,
Feb 1, 2004, 8:40:31 PM2/1/04
to
Russell Wallace wrote:

> On 27 Jan 2004 00:49:04 +0100, Pascal Bourguignon
> <sp...@thalassa.informatimago.com> wrote:
>
>> I prefer one liners:
>>
>> (princ(substitute #\Space #\0(format()"~36R"5688852237040631986030796883)))
>
> I'm curious, what does this do? Do all the programs in this thread do
> the same thing?

I'm curious too: are you really (1) reading comp.lang.lisp
but (2) not in possession of any Lisp implementation? If so,
you should fix #2 as soon as possible...

--
Gareth McCaughan
.sig under construc

Pascal Bourguignon

unread,
Feb 1, 2004, 9:18:07 PM2/1/04
to
wallacet...@eircom.net (Russell Wallace) writes:

> On 27 Jan 2004 00:49:04 +0100, Pascal Bourguignon
> <sp...@thalassa.informatimago.com> wrote:
>
> >I prefer one liners:
> >
> >(princ(substitute #\Space #\0(format()"~36R"5688852237040631986030796883)))
>
> I'm curious, what does this do? Do all the programs in this thread do
> the same thing?

1- Launch your prefered Common-Lisp implementation.
2- Copy the (princ..))) line above.
3- Paste into your Common-Lisp.
4- Type return.
5- See what it does.
6- Repeat with the others.

David Combs

unread,
Feb 12, 2004, 3:20:37 AM2/12/04
to
In article <874quqa...@gruk.tech.ensign.ftech.net>,
Ingvar Mattsson <ing...@cathouse.bofh.se> wrote:
>Tim Bradshaw <t...@cley.com> writes:
>
>> * Ingvar Mattsson wrote:
>>
>> > FWIW, I used to "not have a spam problem" (that is, the ratio of spam
>> > to legitimate email was low enough that the spam was not a
>> > distraction). In the last 3-4 months, it's gone to spam being the
>> > *major* part of my mailbox. Addresses ending up there have been in
>> > existence (and used on Usenet) from 1993, 1996 and somewhere around
>> > 1999 (and one only since last year).
>>
>> Me too. I've struggled with a Bayesian thing but it really doesn't

>> cope: I think it is failing because there isn't enough good mail to
>> train it on (I get hardly any `real' mail: probably 1% of my mail is
>> real), so it essentially classifies everything as spam, which I then
>> have to wade through. I may try feeding it my whole mailbox as good
>> to give it some more data to learn from, but that's only a couple of
>> days worth of spam now, so I'm not sure if it will work.
>
>I initially fed by Bayesian filter with "mail saved away". I *do* tend
>to save most mail I get (two mailing lists get mostly-read, then
>mostly-deleted), so I did have a couple of years' wotrh to feed it
>with.
>

Myself, not having the energy or perhaps even the brainpower
to master procmail and spamassassin etc, etc, plus things
everyone uses but I've never heard of --

What I do is this (bia mutt):

sort first by subject, go to the each end, back off for a
continuous run of clearly-spam ones, and then delete the
whole bunch of them .

Then I sort them by "from", and do the same things -- delete
the wierd ones.

Then, I sort by date, and the vast majority of the newest
messages look strange, so can be deleted.

Also, I note that I find a lot by scanning my down the the date-sorted
subjects, and look for strange chars out to the right. Spam.


David


Christian Lynbech

unread,
Feb 12, 2004, 6:43:35 AM2/12/04
to
>>>>> "David" == David Combs <dkc...@panix.com> writes:

David> Myself, not having the energy or perhaps even the brainpower
David> to master procmail and spamassassin etc, etc, plus things
David> everyone uses but I've never heard of --

I have been using ifile via Jeremy Browns ifile-gnus adaptations but I
was getting increasingly frustrated by the false negatives. I was not
good enough to spot the not-spam mails in the spam folder, so I am now
trying a new approach.

I am still using ifile, but rather than running ifile on the raw mail,
I have gnus render the mail first and then run ifile on the
result. This is of course significantly more costly but as I tend to
get a lot of good mail with voluminous Office document attachments I
hope the filter quality will go up.


------------------------+-----------------------------------------------------
Christian Lynbech | christian #\@ defun #\. dk
------------------------+-----------------------------------------------------
Hit the philistines three times over the head with the Elisp reference manual.
- pet...@hal.com (Michael A. Petonic)

Joe Marshall

unread,
Feb 12, 2004, 10:28:25 AM2/12/04
to
dkc...@panix.com (David Combs) writes:

> Myself, not having the energy or perhaps even the brainpower
> to master procmail and spamassassin etc, etc, plus things
> everyone uses but I've never heard of --

You are lacking gumption. Here is the .procmail file I'm using.
It is currently in `pass spam' mode rather than `discard spam' because
I'm collecting stats.

MAILDIR=${HOME}/Mail
LOGFILE=${HOME}/.procmail-log
SHELL=/bin/sh

:0
* ^SUBJECT: \*\*\*\*\*SPAM\*\*\*\*\**
/dev/null

:0
* ^X-Spam-Status: Yes,.*MICROSOFT_EXECUTABLE
/dev/null

:0
* ^Subject: =\?big5\?Q\?
/dev/null

:0
* ^Subject: =\?Big5\?B\?
/dev/null

:0
* ^X-Spam-Flag: Yes
Incoming

:0:
Incoming

Russell Wallace

unread,
Feb 20, 2004, 12:19:13 PM2/20/04
to
On 02 Feb 2004 01:40:31 +0000, Gareth McCaughan
<gareth.m...@pobox.com> wrote:

>I'm curious too: are you really (1) reading comp.lang.lisp
>but (2) not in possession of any Lisp implementation? If so,
>you should fix #2 as soon as possible...

I'm too lazy to re-download and re-install Corman Lisp for the sake of
running a 3 line program :)

David Combs

unread,
Feb 28, 2004, 5:41:29 PM2/28/04
to

Thanks!

Looks like I'll have to screw up my courage, and give it a try.

David


Joe Marshall

unread,
Feb 28, 2004, 5:57:16 PM2/28/04
to
dkc...@panix.com (David Combs) writes:

> Thanks!
>
> Looks like I'll have to screw up my courage, and give it a try.

Just one thing. Someone pointed out to me that one rule has a bug:

>> :0
>> * ^X-Spam-Flag: Yes
>> Incoming

should be

>> :0:
>> * ^X-Spam-Flag: Yes
>> Incoming


The extra colon locks the `Incoming' file.

--
~jrm

0 new messages