The email package and KLEZ mails

Gerson Kurz

unread,

May 28, 2002, 4:21:56 AM5/28/02

to

I'm using the email module (new in Python 2.2) to analyze messages for
spam and HTML content. However, I get exceptions when analyzing KLEZ
generated mails, which is disappointing since I'm trying to filter
them in the first place.

Anyway, the callstack boils down to:

File "C:\Python22\lib\email\Parser.py", line 81, in _parseheaders
raise Errors.HeaderParseError( 'Not a header, not a continuation:
"%s"' % line)
HeaderParseError: Not a header, not a continuation:
"--X675I6X36yZg9J7wg290j"

Note that I manually added the "%s" bit to the source of Parser.py, to
see what exactly causes the error, the rest of the file is unchanged.
Now, when I look at the email headers, they look something like this:

<QUOTE>
Received: from mcrt-kl-my-1.inter-touch.net (203.121.124.130) by
Shicks! Version 0.9 on darkstar at Tue May 28 08:56:17 2002
Received: from Nrdqtbuem ([203.121.124.154])
by mcrt-kl-my-1.inter-touch.net (8.9.3/8.9.3) with SMTP id
PAA29881
for <...(some address here)...>; Tue, 28 May 2002 15:56:49
+0800
Date: Tue, 28 May 2002 15:56:49 +0800
Message-Id: <...(some msgid here)...>
From: jraj_5 <...(some address here)...>
To: ...(some address here)...
Subject: A good tool
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary=X675I6X36yZg9J7wg290j

--X675I6X36yZg9J7wg290j
Content-Type: text/html;
... (Rest of mail stripped)...
</QUOTE>

and the email package seems to be missing the correct boundary syntax.

Gerhard Häring

unread,

May 28, 2002, 5:08:26 AM5/28/02

to

* Oleg Broytmann <p...@phd.pp.ru> [2002-05-28 12:31 +0400]:

> On Tue, May 28, 2002 at 08:21:56AM +0000, Gerson Kurz wrote:
> > I'm using the email module (new in Python 2.2) to analyze messages for
> > spam and HTML content. However, I get exceptions when analyzing KLEZ
> > generated mails, which is disappointing since I'm trying to filter
> > them in the first place.
>

> Klez is a carefully created virus. It sends mail that specifically
> targets Outofluck holes. Those mail messages are constracted in violation
> of RFCs, so you really cannot parse them with RFC-compliant tools :)

Which is good, because it's a certain sign that you can just throw the
message away because all the interesting email you'll get will be RFC
compliant >:-)

You could send an auto-reply in case somebody's MUA or mail system was
fscked up.

Gerhard
--
This sig powered by Python!
Außentemperatur in München: 8.5 °C Wind: 4.1 m/s

Oleg Broytmann

unread,

May 28, 2002, 5:15:05 AM5/28/02

to

On Tue, May 28, 2002 at 11:08:26AM +0200, Gerhard HДring wrote:
> * Oleg Broytmann <p...@phd.pp.ru> [2002-05-28 12:31 +0400]:
> > On Tue, May 28, 2002 at 08:21:56AM +0000, Gerson Kurz wrote:

> > > I'm using the email module (new in Python 2.2) to analyze messages for
> > > spam and HTML content. However, I get exceptions when analyzing KLEZ
> > > generated mails, which is disappointing since I'm trying to filter
> > > them in the first place.
> >

> > Klez is a carefully created virus. It sends mail that specifically
> > targets Outofluck holes. Those mail messages are constracted in violation
> > of RFCs, so you really cannot parse them with RFC-compliant tools :)
>
> Which is good, because it's a certain sign that you can just throw the
> message away because all the interesting email you'll get will be RFC
> compliant >:-)

At least they should, though it is not always true. RFC 2047 is violated
very often :(

> You could send an auto-reply in case somebody's MUA or mail system was
> fscked up.

You cannot. Klez is *really* very clever virus. It inserts bogus
Reply-To/Return-path headers and envelope headers :(

Oleg.
--
Oleg Broytmann http://phd.pp.ru/ p...@phd.pp.ru
Programmers don't die, they just GOSUB without RETURN.

Oleg Broytmann

unread,

May 28, 2002, 4:31:26 AM5/28/02

to

On Tue, May 28, 2002 at 08:21:56AM +0000, Gerson Kurz wrote:

> I'm using the email module (new in Python 2.2) to analyze messages for
> spam and HTML content. However, I get exceptions when analyzing KLEZ
> generated mails, which is disappointing since I'm trying to filter
> them in the first place.

Klez is a carefully created virus. It sends mail that specifically

targets Outofluck holes. Those mail messages are constracted in violation
of RFCs, so you really cannot parse them with RFC-compliant tools :)

Oleg.

François Pinard

unread,

May 28, 2002, 7:13:05 AM5/28/02

to

[Gerhard Häring]

> Which is good, because it's a certain sign that you can just throw the
> message away because all the interesting email you'll get will be RFC
> compliant >:-)

In my experience, incorrect MIME structure is one of the numerous
hints about mail being SPAM. I do not remember a single false positive.

Strangely enough to me, many SPAM makers still do not follow RFCs.
Consequently, various filters detect various departures from RFCs. Such
detections are only temporary measures: in the long term, SPAM makers are
going to learn and improve. In the mean time, this is mere luck for us :-).

Anthony Baxter

unread,

May 28, 2002, 6:41:42 AM5/28/02

to

Klez (and other misbegotten mailers) screw up MIME. In Klez's case
(and at least one other mail agent) it duplicates a boundary tag
inside the message. I'm working on a patch for the email package that
enables a "non-strict" parser mode. It handles this, and every other
piece of bastardry that's come through my mailbox in the last few
months. I've one more case to go, then a test suite. I expect this
to be finished tomorrow, as I need it for work.

Oleg's comment about 'RFC compliant tools' isn't correct - a basic
principle of internet protocols is that you should be liberal in
what you accept. The current strict parser in the email package isn't
suitable for this.

Anthony

--
Anthony Baxter <ant...@interlink.com.au>
It's never too late to have a happy childhood.

Anthony Baxter

unread,

May 28, 2002, 6:43:26 AM5/28/02

to

>>> Gerhard =?iso-8859-15?Q?H=E4ring?= wrote

> Which is good, because it's a certain sign that you can just throw the
> message away because all the interesting email you'll get will be RFC
> compliant >:-)

If you throw away all email that's not strictly compliant with the
standards, you'll end up losing a lot of mail.

> You could send an auto-reply in case somebody's MUA or mail system was
> fscked up.

Please don't. Auto-replies to viruses, bad emails, or whatever, make the
situation worse, and they very rarely help.

Anthony Baxter

unread,

May 28, 2002, 7:34:22 AM5/28/02

to

> In my experience, incorrect MIME structure is one of the numerous
> hints about mail being SPAM. I do not remember a single false positive.

I wish. I have to deal with end-user email, and trust me, it's not all
spam.

Sheila King

unread,

May 28, 2002, 1:34:11 PM5/28/02

to

On Tue, 28 May 2002 21:34:22 +1000, Anthony Baxter
<ant...@interlink.com.au> wrote in comp.lang.python in article
<mailman.1022585747...@python.org>:

>
> > In my experience, incorrect MIME structure is one of the numerous
> > hints about mail being SPAM. I do not remember a single false positive.
>
> I wish. I have to deal with end-user email, and trust me, it's not all
> spam.

I concur with Anthony. I have written an email filter package using the
email module and if you use the strict Parser class included in that
module, it does throw away too much good email (because any good mail
thrown away is too much). Moreover, as I've mentioned in other posts and
email correspondence, if you're writing software for end users, you really
can't just tell them: "Oh, all those mails that caused errors...they were
just non-RFC compliant. Probably SPAM or virus." First off, it's not 100%
correct. Secondly, why is it that the three other mail readers I use
(Agent, Pegasus, and PocoMail) are all able to parse these messages? I also
agree with the idea that applications must be strict in what they write and
liberal in what they accept.

In extensive correspondence with Barry Warsaw on this matter a few months
back, we came to the understanding that the Parser he provides in the email
module is intended to be a strict, RFC compliant Parser. The design of the
email module, allows Python programmers to plug in their own Parser class
and use it with the rest of the email module to get the flexibility and
functionality that they need. Barry is open to including other types of
Parsers, but his point of view seems to be, that if the strict Parser
provided in the email module cannot parse the email, then the Python
programmer should decide how to handle this and write appropriate code.

I have written a "smart parser" class that I am using in my email filter. I
use this class instead of the Parser class provided with the email module.
I provide the code below for all interested parties. It really does a
pretty good job of handling most mails that the strict Parser cannot
handle. If the "smart parser" class below cannot handle your email, then
there is a non-documented HeaderParser class in the email module (see the
source code for the Parser class) which is your next best chance.
Otherwise, you will have to write your own routines for parsing the
message.

The "smart parser" class below is adapted from the Parser code provided in
the email module. I have been using this in a production environment for a
couple of months now, and have quite a number of other beta testers also
using it, and we get almost no mails that cannot be parsed.

Caution: Because this module makes "assumptions" about the structure of the
message, in the case that the received email is not RFC compliant, if you
try to use one of the Generators to print the message (which is called when
printing) it will possibly print a message that is not identical to the raw
message which was received. You may want to somehow save the raw message in
your code elsewhere, if you might need the original raw message.

Code follows the signature. Enjoy,

--
Sheila King
http://www.thinkspot.net/sheila/
http://www.k12groups.org/
http://www.FutureQuest.net

##### CODE FOR SMART PARSER CLASS #####

from email.Parser import Parser

class smart_Parser(Parser):

def parse(self, fp):
root = self._class()
self._parseheaders(root, fp)
self._parsebody(root, fp)
return root

def parsestr(self, text):
return self.parse(StringIO(text))

def _parseheaders(self, container, fp):
# Parse the headers, returning a list of header/value pairs.
# None as
# the header means the Unix-From header.
lastheader = ''
lastvalue = []
lineno = 0
while 1:
line = fp.readline()[:-1]
if not line or not line.strip():
break
lineno += 1
# Check for initial Unix From_ line
if line.startswith('From '):
if lineno == 1:
container.set_unixfrom(line)
continue
else:
raise Errors.HeaderParseError(
'Unix-from in headers after first rfc822 header')
#
# Header continuation line
if line[0] in ' ':
if not lastheader:
raise Errors.HeaderParseError(
'Continuation line seen before first header')
lastvalue.append(line)
continue
# Normal, non-continuation header.
# BAW: this should check to make
# sure it's a legal header, e.g. doesn't contain spaces.
# Also, we
# should expose the header matching algorithm in the API, and
# allow for a non-strict parsing mode (that ignores the line
# instead of raising the exception).
i = line.find(':')
if i < 0:
raise Errors.HeaderParseError(
'Not a header, not a continuation')
if lastheader:
container[lastheader] = NL.join(lastvalue)
lastheader = line[:i]
lastvalue = [line[i+1:].lstrip()]
# Make sure we retain the last header
if lastheader:
container[lastheader] = NL.join(lastvalue)

def _parsebody(self, container, fp):
boundary = container.get_boundary()
isdigest = (container.get_type() == 'multipart/digest')
if boundary:
preamble = epilogue = None
separator = '--' + boundary
payload = fp.read()
start = payload.find(separator)
if start < 0:
container.add_payload(payload)
return
if start > 0:
preamble = payload[0:start]
start += len(separator) + 1 + isdigest
terminator = payload.find('\n' + separator + '--', start)
if terminator < 0:
terminator = len(payload)
if terminator + len(separator) + 3 < len(payload):
epilogue = payload[terminator + len(separator) + 3:]
if isdigest:
separator += '\n\n'
else:
separator += '\n'
parts = payload[start:terminator].split('\n' + separator)
for part in parts:
if type(part) is type('') and not part.strip():
parts.remove(part)
elif part:
msgobj = self.parsestr(part)
container.preamble = preamble
container.epilogue = epilogue
if not isinstance(container.get_payload(), type([])):
container.set_payload([msgobj])
else:
container.add_payload(msgobj)
elif container.get_type() == 'message/delivery-status':
# This special kind of type contains blocks
# of headers separated
# by a blank line. We'll represent each header block as a
# separate Message object
blocks = []
while 1:
blockmsg = self._class()
self._parseheaders(blockmsg, fp)
if not len(blockmsg):
# No more header blocks left
break
blocks.append(blockmsg)
container.set_payload(blocks)
elif container.get_main_type() == 'message':
# Create a container for the payload,
# but watch out for there not
# being any headers left
try:
msg = self.parse(fp)
except Errors.HeaderParseError:
msg = self._class()
self._parsebody(msg, fp)
container.add_payload(msg)
else:
container.add_payload(fp.read())

François Pinard

unread,

May 29, 2002, 10:17:23 PM5/29/02

to

[Sheila King]
> [Anthony Baxter]
> > [François Pinard]

> > > In my experience, incorrect MIME structure is one of the numerous
> > > hints about mail being SPAM. I do not remember a single false positive.

> > I wish. I have to deal with end-user email, and trust me, it's not all
> > spam.

> I concur with Anthony. I have written an email filter package using the
> email module and if you use the strict Parser class included in that
> module, it does throw away too much good email (because any good mail
> thrown away is too much).

Maybe the `email' package is stricter than the various MIME processing
tools that were in Python 1.5.2 in still exist in more recent versions,
but I would be tempted to think they are of comparable strictness. I do
not really know.

The proverb ways that "alike people get together", it might explain why
I do not see more problems: most of my correspondents have mailer agents
which do a fair job at MIME generation. And when MIME mistakes happens,
it is usually sufficient to raise the subject with my correspondents,
who are usually happy to get the problem solved at their end.

Often (but not necessarily), badly structured messages come from people
who do not care much. Otherwise, they would have set up themselves better.
As I much prefer people who care, from my viewpoint, there is a significant
correlation between a message being MIME-erroneous and a message not being
worth much interest.

> Moreover, as I've mentioned in other posts and email correspondence,
> if you're writing software for end users, you really can't just
> tell them: "Oh, all those mails that caused errors...they were just
> non-RFC compliant. Probably SPAM or virus."

If you are writing filters for everybody, you are probably right. When I
write filters for my friends or for myself, in my experience, careless
MIME may be filtered out as SPAM, and we do not loose much in practice :-).

> Secondly, why is it that the three other mail readers I use (Agent,
> Pegasus, and PocoMail) are all able to parse these messages? I also
> agree with the idea that applications must be strict in what they write
> and liberal in what they accept.

This is a good principle, but only when kept within reasonable bounds.
Users should be on the side of being strict, and applications should be on
the side of being liberal. Users might suffer uselessly by being overly
ascetic, applications might miss their goal through unlimited friendliness.

For example, I expect compilers to raise diagnostics and help me at being
strict, because being overly liberal for a compiler is just not helpful.
Another example, a sad one, is the messy state of HTML all around us,
it comes from browsers having been by far too liberal, and for too long.

If mailer agents are very lenient to MIME mis-formatting, they actively
prevent progress. They do not really help it, as they trigger confusion.
Moreover, by implementing MIME poorly, they throw discredit on a good idea.
MIME standards are not that hard to read, you know. It is a mystery to
me why some mail agents mangle the MIME they generate, or miss to assemble
it conveniently, in the spirit of the standards, at presentation time.

> I have written a "smart parser" class that I am using in my email
> filter. I use this class instead of the Parser class provided with the
> email module. I provide the code below for all interested parties.

> [...] Code follows the signature. Enjoy,

I'm saving it for possible later use! Thanks for providing this...

--
François Pinard http://www.iro.umontreal.ca/~pinard

Sheila King

unread,

May 30, 2002, 2:20:15 AM5/30/02

to

On 29 May 2002 22:17:23 -0400, pin...@iro.umontreal.ca (François Pinard)

wrote in comp.lang.python in article

<mailman.1022725133...@python.org>:

> [Sheila King]
> > [Anthony Baxter]
> > > [François Pinard]
>
> > > > In my experience, incorrect MIME structure is one of the numerous
> > > > hints about mail being SPAM. I do not remember a single false positive.
>
> > > I wish. I have to deal with end-user email, and trust me, it's not all
> > > spam.
>
> > I concur with Anthony. I have written an email filter package using the
> > email module and if you use the strict Parser class included in that
> > module, it does throw away too much good email (because any good mail
> > thrown away is too much).
>
> Maybe the `email' package is stricter than the various MIME processing
> tools that were in Python 1.5.2 in still exist in more recent versions,
> but I would be tempted to think they are of comparable strictness. I do
> not really know.

Well, no, the email package appears to be stricter. I've used code that had
the old 1.5.2 parser in it, and it "broke" noticeably with the new email
module, due to the strictness.

> The proverb ways that "alike people get together", it might explain why
> I do not see more problems: most of my correspondents have mailer agents
> which do a fair job at MIME generation. And when MIME mistakes happens,
> it is usually sufficient to raise the subject with my correspondents,
> who are usually happy to get the problem solved at their end.
>
> Often (but not necessarily), badly structured messages come from people
> who do not care much. Otherwise, they would have set up themselves better.
> As I much prefer people who care, from my viewpoint, there is a significant
> correlation between a message being MIME-erroneous and a message not being
> worth much interest.

??? I don't understand the point of this? I had an email message I received
from the razor mailing list that couldn't be parsed by the email module.
Now that is a list for people who care very much about email and preventing
spam. So...?

And, if I get emails that don't parse, I should do what? Change my circle
of email friends? Sorry, but your points above are lost on me.

> > Moreover, as I've mentioned in other posts and email correspondence,
> > if you're writing software for end users, you really can't just
> > tell them: "Oh, all those mails that caused errors...they were just
> > non-RFC compliant. Probably SPAM or virus."
>
> If you are writing filters for everybody, you are probably right. When I
> write filters for my friends or for myself, in my experience, careless
> MIME may be filtered out as SPAM, and we do not loose much in practice :-).

I have had a different experience than you, as I've pointed out. Most of it
is spam, but some is not.

> > Secondly, why is it that the three other mail readers I use (Agent,
> > Pegasus, and PocoMail) are all able to parse these messages? I also
> > agree with the idea that applications must be strict in what they write
> > and liberal in what they accept.
>
> This is a good principle, but only when kept within reasonable bounds.
> Users should be on the side of being strict, and applications should be on
> the side of being liberal. Users might suffer uselessly by being overly
> ascetic, applications might miss their goal through unlimited friendliness.

I agree.

> For example, I expect compilers to raise diagnostics and help me at being
> strict, because being overly liberal for a compiler is just not helpful.
> Another example, a sad one, is the messy state of HTML all around us,
> it comes from browsers having been by far too liberal, and for too long.

> If mailer agents are very lenient to MIME mis-formatting, they actively
> prevent progress. They do not really help it, as they trigger confusion.
> Moreover, by implementing MIME poorly, they throw discredit on a good idea.
> MIME standards are not that hard to read, you know. It is a mystery to
> me why some mail agents mangle the MIME they generate, or miss to assemble
> it conveniently, in the spirit of the standards, at presentation time.

I agree that there is a point where one can go too far. However, I don't
think that the email module is in any danger of that. I've seen quite a few
articles posted here in this newsgroup from people who are having
difficulty parsing emails with that module. It wouldn't be a bad idea to
make it better able to handle some of the offending emails that it
currently cannot handle.

> > I have written a "smart parser" class that I am using in my email
> > filter. I use this class instead of the Parser class provided with the
> > email module. I provide the code below for all interested parties.
> > [...] Code follows the signature. Enjoy,
>
> I'm saving it for possible later use! Thanks for providing this...

You are welcome. So far as I can tell from reading a few of the messages on
the mimelib developers list, it looks like others who are also interested
in this "problem" will possibly come up with an even better solution in the
not too distant future.

I'm keeping my fingers crossed.

Dmitri I GOULIAEV

unread,

May 30, 2002, 3:24:35 AM5/30/02

to

Hi, Sheila King !

On Wed, May 29, 2002 at 11:20:15PM -0700, Sheila King wrote:

> (Francois Pinard)
> > [Sheila King]

> > Often (but not necessarily), badly structured messages come from people
> > who do not care much. Otherwise, they would have set up themselves better.
> > As I much prefer people who care, from my viewpoint, there is a significant
> > correlation between a message being MIME-erroneous and a message not being
> > worth much interest.
>
> ??? I don't understand the point of this? I had an email message I received
> from the razor mailing list that couldn't be parsed by the email module.
> Now that is a list for people who care very much about email and preventing
> spam. So...?

They should care little bit more ? <wink>

Best regards,

--
DIG (Dmitri I GOULIAEV)

All below this line is added by my e-mail provider.

Kragen Sitaker

unread,

Jun 1, 2002, 2:17:33 AM6/1/02

to

pin...@iro.umontreal.ca (François Pinard) writes:
> For example, I expect compilers to raise diagnostics and help me at being
> strict, because being overly liberal for a compiler is just not helpful.
> Another example, a sad one, is the messy state of HTML all around us,
> it comes from browsers having been by far too liberal, and for too long.

If you had designed the early web browsers, the web never would have
caught on, as indeed many other networked hypertext systems predating
the Web did not; being liberal in what you accept is a crucial
principle in building decentralized systems, and its embodiment in the
Web made the Web possible.

Having "bad HTML" warnings is, of course, extremely helpful. It makes
it possible to be conservative in what you send.