python-list@python.org

MRAB

unread,

Jan 14, 2014, 12:00:48 PM1/14/14

to pytho...@python.org

On 2014-01-14 16:37, Florian Lindner wrote:
> Hello!
>
> I'm using python 3.2.3 on debian wheezy. My script is called from my mail delivery agent (MDA) maildrop (like procmail) through it's xfilter directive.
>
> Script works fine when used interactively, e.g. ./script.py < testmail but when called from maildrop it's producing an infamous UnicodeDecodeError:
>
> File "/home/flindner/flofify.py", line 171, in main
> mail = sys.stdin.read()
> File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
> return codecs.ascii_decode(input, self.errors)[0]
>
> Exception for example is always like
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 869: ordinal not in range(128)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1176: ordinal not in range(128)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8c in position 846: ordinal not in range(128)
>
> I read mail from stdin "mail = sys.stdin.read()"
>
> Environment when called is:
>
> locale.getpreferredencoding(): ANSI_X3.4-1968
> environ["LANG"]: C
>
> System environment when using shell is:
>
> ~ % echo $LANG
> en_US.UTF-8
>
> As far as I know when reading from stdin I don't need an decode(...) call, since stdin has a decoding. I also tried some decoding/encoding stuff but changed nothing.
>
> Any ideas to help me?
>
When run from maildrop it thinks that the encoding of stdin is ASCII.

Peter Otten

unread,

Jan 14, 2014, 12:43:57 PM1/14/14

to pytho...@python.org

I known nothing about maildrop, but found

> add "import LANG" to .maildropfilter.

in this thread:

<http://courier-mail-server.10983.n7.nabble.com/Maildrop-behaviour-change-
td18610.html>

Florian Lindner

unread,

Jan 14, 2014, 8:25:34 PM1/14/14

to pytho...@python.org

Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB:

> On 2014-01-14 16:37, Florian Lindner wrote:
> > Hello!
> >
> > I'm using python 3.2.3 on debian wheezy. My script is called from my mail delivery agent (MDA) maildrop (like procmail) through it's xfilter directive.
> >
> > Script works fine when used interactively, e.g. ./script.py < testmail but when called from maildrop it's producing an infamous UnicodeDecodeError:
> >
> > File "/home/flindner/flofify.py", line 171, in main
> > mail = sys.stdin.read()
> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
> > return codecs.ascii_decode(input, self.errors)[0]
> >
> > Exception for example is always like
> >
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 869: ordinal not in range(128)
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1176: ordinal not in range(128)
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x8c in position 846: ordinal not in range(128)
> >
> > I read mail from stdin "mail = sys.stdin.read()"
> >
> > Environment when called is:
> >
> > locale.getpreferredencoding(): ANSI_X3.4-1968
> > environ["LANG"]: C
> >
> > System environment when using shell is:
> >
> > ~ % echo $LANG
> > en_US.UTF-8
> >
> > As far as I know when reading from stdin I don't need an decode(...) call, since stdin has a decoding. I also tried some decoding/encoding stuff but changed nothing.
> >
> > Any ideas to help me?
> >

> When run from maildrop it thinks that the encoding of stdin is ASCII.

Well, true. But what encoding does maildrop actually gives me? It obviously does not inherit LANG or is called from the MTA that way. I also tried:

inData = codecs.getreader('utf-8')(sys.stdin)
mail = inData.read()

Failed also. But I'm not exactly an encoding expert.

Regards,
Florian

MRAB

unread,

Jan 14, 2014, 10:49:26 PM1/14/14

to pytho...@python.org

locale.getpreferredencoding() said "ANSI_X3.4-1968", which is ASCII
(ask Wikipedia if you want to know why it's called that!).

> inData = codecs.getreader('utf-8')(sys.stdin)
> mail = inData.read()
>
> Failed also. But I'm not exactly an encoding expert.
>

Try:

sys.stdin = codecs.getreader('utf-8')(sys.stdin.detach())

Steven D'Aprano

unread,

Jan 15, 2014, 7:38:21 PM1/15/14

to

On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote:

> Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB:
>> On 2014-01-14 16:37, Florian Lindner wrote:
>> > Hello!
>> >
>> > I'm using python 3.2.3 on debian wheezy. My script is called from my
>> > mail delivery agent (MDA) maildrop (like procmail) through it's
>> > xfilter directive.
>> >
>> > Script works fine when used interactively, e.g. ./script.py <
>> > testmail but when called from maildrop it's producing an infamous
>> > UnicodeDecodeError:

What's maildrop? When using third party libraries, it's often helpful to
point to give some detail on what they are and where they are from.

>> > File "/home/flindner/flofify.py", line 171, in main
>> > mail = sys.stdin.read()

What's the value of sys.stdin? If you call this from your script:

print(sys.stdin)

what do you get? Is it possible that the mysterious maildrop is messing
stdin up?

>> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
>> > return codecs.ascii_decode(input, self.errors)[0]
>> >
>> > Exception for example is always like
>> >
>> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position
>> > 869: ordinal not in range(128)

That makes perfect sense: byte 0x82 is not in the ASCII range. ASCII is
limited to bytes values 0 through 127, and 0x82 is hex for 130. So the
error message is telling you *exactly* what the problem is: your email
contains a non-ASCII character, with byte value 0x82.

How can you deal with this?

(1) "Oh gods, I can't deal with this, I wish the whole world was America
in 1965 (except even back then, there were English characters in common
use that can't be represented in ASCII)! I'm going to just drop anything
that isn't ASCII and hope it doesn't mangle the message *too* badly!"

You need to set the error handler to 'ignore'. How you do that may depend
on whether or not maildrop is monkeypatching stdin.

(2) "Likewise, but instead of dropping the offending bytes, I'll replace
them with something that makes it obvious that an error has occurred."

Set the error handler to "replace". You'll still mangle the email, but it
will be more obvious that you mangled it.

(3) "ASCII? Why am I trying to read email as ASCII? That's not right.
Email can contain arbitrary bytes, and is not limited to pure ASCII. I
need to work out which encoding the email is using, but even that is not
enough, since emails sometimes contain the wrong encoding information or
invalid bytes. Especially spam, that's particularly poor. (What a
surprise, that spammers don't bother to spend the time to get their code
right?) Hmmm... maybe I ought to use an email library that actually gets
these issues *right*?"

What does the maildrop documentation say about encodings and/or malformed
email?

>> > I read mail from stdin "mail = sys.stdin.read()"
>> >
>> > Environment when called is:
>> >
>> > locale.getpreferredencoding(): ANSI_X3.4-1968 environ["LANG"]: C

For a modern Linux system to be using the C encoding is not a good sign.
It's not 1970 anymore. I would expect it should be using UTF-8. But I
don't think that's relevant to your problem (although a mis-configured
system may make it worse).

>> > System environment when using shell is:
>> >
>> > ~ % echo $LANG
>> > en_US.UTF-8

That's looking more promising.

>> > As far as I know when reading from stdin I don't need an decode(...)
>> > call, since stdin has a decoding.

That depends on what stdin actually is. Please print it and show us.

Also, can you do a visual inspection of the email that is failing? If
it's spam, perhaps you can just drop it from the queue and deal with this
issue later.

>> > I also tried some decoding/encoding
>> > stuff but changed nothing.

Ah, but did you try the right stuff? (Randomly perturbing your code in
the hope that the error will go away is not a winning strategy.)

>> > Any ideas to help me?
>> >
>> When run from maildrop it thinks that the encoding of stdin is ASCII.
>
> Well, true. But what encoding does maildrop actually gives me? It
> obviously does not inherit LANG or is called from the MTA that way.

Who knows? What's maildrop? What does its documentation say about
encodings? The fact that it is using ASCII apparently by default does not
give me confidence that it knows how to deal with 8-bit emails, but I
might be completely wrong.

> I also tried:
>
> inData = codecs.getreader('utf-8')(sys.stdin)
> mail = inData.read()
>
> Failed also. But I'm not exactly an encoding expert.

Failed how? Please copy and paste your exact exception traceback, in full.

Ultimately, dealing with email is a hard problem. So long as you only
receive 7-bit ASCII mail, you don't realise how hard it is. But the
people who write the mail libraries -- at least the good ones -- know
just how hard it really is. You can have 8-bit emails with no encoding
set, or the wrong encoding, or the right encoding but the contents then
includes invalid bytes. It's not just spammers who get it wrong,
legitimate programmers sending email also screw up.

Email is worse than the 90/10 rule. 90% of the effort is needed to deal
with 1% of the emails. (More if you have a really bad spam problem.) You
should look at a good email library, like the one in the std lib which I
believe gets most of these issues right.

--
Steven

Ben Finney

unread,

Jan 15, 2014, 7:52:41 PM1/15/14

to pytho...@python.org

Steven D'Aprano <steve+comp....@pearwood.info> writes:

> On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote:
> >> On 2014-01-14 16:37, Florian Lindner wrote:
> >> > I'm using python 3.2.3 on debian wheezy. My script is called from
> >> > my mail delivery agent (MDA) maildrop (like procmail) through
> >> > it's xfilter directive.
> >> >
> >> > Script works fine when used interactively, e.g. ./script.py <
> >> > testmail but when called from maildrop it's producing an infamous
> >> > UnicodeDecodeError:
>
> What's maildrop? When using third party libraries, it's often helpful to
> point to give some detail on what they are and where they are from.

It's not a library; as he says, it's an MDA program. It is from the
Courier mail application <URL:http://www.courier-mta.org/maildrop/>.

>From that, I understand Florian to be saying his Python program is
invoked via command-line from some configuration directive for Maildrop.

> What does the maildrop documentation say about encodings and/or
> malformed email?

I think this is the more likely line of enquiry to diagnose the problem.

> For a modern Linux system to be using the C encoding is not a good
> sign.

That's true, but it's likely a configuration problem: the encoding needs
to be set *and* obeyed at an administrative and user-profile level.

> It's not 1970 anymore. I would expect it should be using UTF-8. But I
> don't think that's relevant to your problem (although a mis-configured
> system may make it worse).

Since the MDA runs usually not as a system service, but rather at a
user-specific level, I would expect some interaction of the host locale
and the user-specific locale is the problem.

> Who knows? What's maildrop? What does its documentation say about
> encodings?

I hope the original poster enjoys manpages, since that's how the program
is documented <URL:http://www.courier-mta.org/maildrop/documentation.html>.

> The fact that it is using ASCII apparently by default does not give me
> confidence that it knows how to deal with 8-bit emails, but I might be
> completely wrong.

I've found that the problem is often that Python is the party assuming
that stdin and stdout are ASCII, largely because it hasn't been told
otherwise.

--
\ “The greatest tragedy in mankind's entire history may be the |
`\ hijacking of morality by religion.” —Arthur C. Clarke, 1991 |
_o__) |
Ben Finney